Most of the modern speech processing systems require a large amount of audio and text data for training acoustic and language models. Depending on the type of an application required data varies from high quality microphone read speech to conversational telephone speech, from continuous speech to connected and isolated words. Currently the read-speech corpus (RSC) contains more than 40 hours of high quality microphone read Kazakh speech of 169 native speakers for the large vocabulary continuous speech recognition tasks.

The text materials to be uttered were carefully selected from the primary section of the corpus and divided into two parts: sentences and stories. The sentences part has more than 12 000 different sentences randomly and equally extracted from texts covering all of the five genres present in the corpus. The sentences are chosen so that in total they contain more than 120 000 words which belong to the set of the most frequent words that cover the 95% of all the texts in the corpus. Additionally, the sentences were grouped according to their length in words. Thus, there are ten groups of sentences, so that the first group contains the sentences of length six, the second – of length seven, and so on up until the length of 15. The stories part contains the short online news extracted from publicistic genre section of the corpus. Each story consists of up to 300 words. All the materials were subdivided into non-intersecting sets of texts and distributed among the speakers in the following manner. Each speaker was assigned exactly 75 sentences and one story. Of the 75 sentences 50 belonged to the first five short-sentenced groups (10 sentences per each group), and the remaining 25 belonged to the last five long-sentenced groups (5 sentences per each group).

The main criteria of a speaker selection were the following: a region where (s)he learned Kazakh or spent most of his/her life; age; gender; and the ability to read Kazakh. The first criterion helps in capturing various accents attributed to speakers’ settlement both local and external. From the regional perspective the speakers are divided into 15 groups: 14 domestic (one per each administrative region, i.e. oblast, of Kazakhstan) and one abroad (all foreign countries). Furthermore, the speakers are divided into the following four age groups (not including children and school students):

  • I group – 18-27 years;
  • II group – 28-37 years;
  • III group – 38-47 years;
  • IV group – 48 years and above;

A female-to-male distribution of speakers is 57% to 43%, respectively, with not more than three speakers of the same gender per one age-regional group. Additionally, a record of the speakers’ education is kept, i.e. whether they attended and graduated from a university, or graduated from a school or a college without attending any universities. The speakers were encoded using the following scheme: <Region><Gender><Year of birth><Initials><Education>, where "Region" holds the values in the range of [1-15], "Gender" – F or M, "Year of birth" – the last two digits of a year of birth, "Initials" – initials of a name followed by a surname, "Education" – 1 for school, 2 for college, and 3 for university, e.g. 06F70ZK3. In total 169 speakers were recorded.

Table 1 presents a distribution of the speakers across the age, gender and regional groups. The blank spots show the speaker profiles that could not be recruited. Mostly, these cases correspond to the distant regions and elder male groups.

Table 1
Age Groups 1 2 3 4
Code Province F1 M1 F2 M2 F3 M3 F4 M4 Sum
1 Akmola 3 3 2 1 2 1 2 1 15
2 Aktobe 2 3 2 1 2 1 11
3 Almaty 1 1 2 3 2 1 1 11
4 Atyrau 3 2 1 1 7
5 East Kazakhstan 2 2 2 1 2 2 2 1 14
6 Zhambyl 2 2 2 2 2 1 2 13
7 West Kazakhstan 2 2 1 2 2 2 1 12
8 Karagandy 2 1 1 2 1 1 2 1 11
9 Kostanay 3 2 2 1 3 1 1 1 14
10 Кызылординская область 1 1 2 2 1 1 2 1 11
11 Mangystau 2 1 2 1 1 2 9
12 Pavlodar 2 2 2 2 1 2 1 12
13 North Kazakhstan 2 2 2 1 1 1 1 1 11
14 South Kazakhstan 2 1 1 1 1 2 1 2 11
15 Other 1 3 1 2 7
Sum 30 28 23 20 22 12 21 13 169

Recording setup

The actual recording sessions took place in a sound-proof studio of the university with the assistance of a sound operator. Before the recordings, the speakers were instructed, documented and given some time to prepare, as well as asked to fill in the copyright transfer form for the audio data with their voice. They were not constrained on the manner, speed or time except for the correctness of reading. The average time for a recording session per speaker was about 40-45 minutes, though there were cases that lasted for two hours. Audio data was captured using a professional vocal microphone Neumann TLM 49 and digitized by LEXICON I-ONIX U82S sound card. The format of the recorded audio files is 44.1 kHz 16-bit PCM-encoded mono WAVE file format. All the recorded audio files were manually post-processed to have each utterance (sentences and stories) in a separate file and in the corresponding directories. The size of the speech corpus is about 8.5 GB on disk. A collective duration of the audio files is more than 40 hours long.

Each audio file is accompained with its corresponding orthographic transcription and TIMIT-style word-level segmentation, as well as morpho-syntactic annotation files. Both the transcript generation and the annotation were performed manually by trained linguists. The transcription files contain the exact orthographic transcriptions of the utterances, which may differ from the original text. For example, the numbers, abbreviation, foreign words and dates are expanded depending on how they were uttered by the speakers. In addition, the transcription of the stories have the sentence boundaries labeled with <s> and </s>.

For the segmentation an open-source tool for sound visualization and manipulation, WaveSurfer was used. The tool supports TIMIT word-level transcription format. Although, it supports Unicode, it does not provide a proper support for Kazakh symbols. Therefore, an ASCII version of the Kazakh letters was used, cf. Table 2. The # symbol was used to mark pauses and silence, and other non-speech events were marked by the ^ symbol.

Table 2
LetterASCII versionLetterASCII version

Speaker 03M75ZI3
Speaker 09F74BA3