Currently, the size of the primary corpus exceeds 135 million words and it contains more than 400 000 documents classified into five major genres (domains):

  1. Literary: comprises Kazakh literary works, including novels, stories, poems, etc., published in the range from the beginning of the XX century till present;
  2. Official: includes mainly official statutes, orders, acts and other legal documents produced by the governmental organizations within 2009 and 2012;
  3. Scientific: includes books, research monographs, dissertations, articles and essays from various areas, such as informatics, biology, chemistry, etc.;
  4. Publicistic (mass media): comprises periodicals and articles from online sources, i.e. newspapers and magazines published over the last ten years;
  5. Informal: includes documents with colloquial Kazakh texts extracted from the popular blog platforms starting from 2009.
Documents were not selected by strict criteria such as domain, time and medium of texts during the corpus compilation. This decision was mainly dictated by the lack of materials, partially due to the reasons mentioned in the introduction.

Each document is stored in a plain text format in the UTF-8 encoding. Documents contain both the content and the meta-data in a single file, and have the following simple structure:

  • TITLE – the title of a document;
  • SOURCE – the source of a document;
  • AUTHOR – the author(s) of a document;
  • DATE – the date when a document was published;
  • META – additional information;
  • TEXT – the content of a document.

Provided that the corresponding information is present in a source, the <META> tag contains both the name of a section of the corpus to which a document belongs and a further categorical sub-division, such as the type of a literary work, e.g. a poem. For sources that lack meta data, such as the digitized books, dissertations and scientific papers, the corresponding categories (informatics, biology, chemistry, etc.) are assigned manually.

The main sources of data were Internet websites as well as digitized books, dissertations and articles from public and personal libraries. helped to filter out various types of noise such as, texts in other languages, web banners, etc. It took about 7 months to grow the corpus to its current size, and the process of data collection is still in progress. Table 1 provides a general quantitative description of the corpus.

Table 1: Genre statistics
Genre# docs# all words# unique words
Literary8 2557 733 456423 445
Publicistic404 88479 302 154951 659
Official25 30244 670 856335 264
Scientific5272 227 878153 877
Informal6 1101 337 953162 074
TOTAL445 078135 272 2971 365 202

Current Kazakh Cyrillic alphabet consists of 42 letters. Of those 9 are pure Kazakh letters and the others adopt the Russian symbolic. Figure 1 shows the distribution of Kazakh letters in the corpus in the descending order of occurrence frequencies. It can be seen that there is a small non-zero distribution of pure Russian letters (underlined). This can be explained by the ineluctable use of Russian words due to the lack of a proper translation or inheritance of Russian vocabulary.

Figure 1: Letter distribution


Table 2 lists for each letter the exact number of general and specific (e.g. leading, middle, trailing) occurrences in the alphabetic order.
Table 2: Letter statistics
Letter HTML Code Unicode lowercase Unicode uppercase % Occurence Leading Middle Trailing
а &#1072; u+0430 u+0410 12.4727 99654792 9529745 79561513 10563534
ә &#1241; u+04d9 u+04d8 0.8693 6945654 1879539 5021725 44390
б &#1073; u+0431 u+0411 2.373 18959613 12193649 6566276 199688
в &#1074; u+0432 u+0412 0.2869 2292241 317629 1440292 534320
г &#1075; u+0433 u+0413 1.2187 9736944 410610 9086048 240286
ғ &#1171; u+0493 u+0492 1.5419 12319232 436842 11868245 14145
д &#1076; u+0434 u+0414 4.2401 33877999 3952698 29643851 281450
е &#1077; u+0435 u+0415 7.7358 61807524 3640844 49398668 8768012
ё &#1105; u+0451 u+0401 0.0002 1894 133 1529 232
ж &#1078; u+0436 u+0416 1.8113 14471977 11275286 3029432 167259
з &#1079; u+0437 u+0417 1.4466 11557878 720115 9177930 1659833
и &#1080; u+0438 u+0418 1.478 11809258 755410 10406482 647366
й &#1081; u+0439 u+0419 1.2743 10181216 4390 8308612 1868214
к &#1082; u+043a u+041a 2.9624 23669282 7228054 12918398 3522830
қ &#1179; u+049b u+049a 3.3929 27108431 11005931 10564192 5538308
л &#1083; u+043b u+041b 5.1998 41545645 257572 39049114 2238959
м &#1084; u+043c u+041c 3.4132 27270997 6682926 18922872 1665199
н &#1085; u+043d u+041d 6.7189 53682469 1761785 35101031 16819653
ң &#1187; u+04a3 u+04a2 1.5403 12306815 1508 3556079 8749228
о &#1086; u+043e u+041e 2.5321 20231322 5340291 14567580 323451
ө &#1257; u+04e9 u+04e8 0.8853 7073380 2447075 4617496 8809
п &#1087; u+043f u+041f 1.4 11185839 1633197 6059293 3493349
р &#1088; u+0440 u+0420 5.656 45190041 1567103 37812903 5810035
с &#1089; u+0441 u+0421 4.2802 34197856 6689242 25331877 2176737
т &#1090; u+0442 u+0422 6.2169 49671631 8337601 39100402 2233628
у &#1091; u+0443 u+0423 2.1609 17265091 408440 12226529 4630122
ұ &#1201; u+04b1 u+04b0 0.8227 6572874 1221711 5319102 32061
ү &#1199; u+04af u+04ae 0.6747 5390413 1503082 3883683 3648
ф &#1092; u+0444 u+0424 0.1265 1010650 341540 647235 21875
х &#1093; u+0445 u+0425 0.2325 1857959 738671 1031537 87751
һ &#1211; u+04bb u+04ba 0.0078 62643 5338 47858 9447
ц &#1094; u+0446 u+0426 0.1491 1190959 60043 1115215 15701
ч &#1095; u+0447 u+0427 0.0343 274205 52708 203642 17855
ш &#1096; u+0448 u+0428 1.2641 10100192 2382882 7258931 458379
щ &#1097; u+0449 u+0429 0.0059 46787 3849 41774 1164
ъ &#1098; u+044a u+042a 0.0151 120832 1109 115939 3784
ы &#1099; u+044b u+042b 7.7921 62257817 211539 47916592 14129686
і &#1110; u+0456 u+0406 5.1672 41285230 1303034 30925559 9056637
ь &#1100; u+044c u+042c 0.0703 561953 322 391310 170321
э &#1101; u+044d u+042d 0.0796 635713 447705 184848 3160
ю &#1102; u+044e u+042e 0.0835 667340 18042 589922 59376
я &#1103; u+044f u+042f 0.3668 2930963 73030 2285595 572338