Welcome to KLC, an open Kazakh Language Corpus!

This site hosts the KLC project providing a search mechanism, sample downloads, documentation, and some interesting statistics and information. This project is one of the first attempts made within a local research community to assemble an open annotated Kazakh corpus. KLC is designed to be a large scale corpus containing over 135 million words and conveying five major stylistic genres (domains): literary, publicistic, official, scientific and informal. Along with its primary part KLC comprises such parts as: (i) deeply annotated sub-corpus, containing segmented documents encoded in XML that marks complete morphological, syntactic, and structural characteristics of texts; (ii) a read-speech corpus, a sub-corpus with the annotated speech data.

Kazakh is an agglutinative and highly inflected language that belongs to the Turkic group. It is the official state language of Kazakhstan, and a mother tongue for more than 10 million people all around the world. However, up until the early 90’s of 20th century, due to historical reasons of the Soviet era, Russian language was the predominant language in spoken and written communication in Kazakhstan. This fact in turn caused the problem of underrepresentation of Kazakh in various fields such as science, entertainment, official documentation, etc. For this reason, categories that are usually represented as separate genres/domains were grouped into five major categories, e.g. poems, novels, and stories were all grouped under literary category. Thus, texts were included as they were available without an intention to complete a predefined set of categories.

A portion of data has been manually annotated with morpho-syntactic and structural markups encoded in XML following general notions outlined in CES. The syntactic tagset comprises a set of syntactic categories well-defined in a classical Kazakh grammar, and the part of speech (POS) tagset is based on a positional system in which the tags are formed by concatenation of POS of a word form and a chain of encoded linguistic properties, such as number, case, voice etc. The annotation process has been carried out by Kazakh philology students specializing in morphology and syntax. A web-based annotation tool was designed to provide an accurate and comfortable annotation experience.

Finally, KLC contains the annotated read-speech corpus (RSC), which includes audio recordings of words, phrases, sentences (from all genres), news articles and excerpts from books that were carefully chosen from the primary part of the corpus. All text materials were read by volunteers who represented different age, gender, region and education backgrounds in a balanced way. Each audio file is accompanied with a label file and a corresponding text transcript. Moreover, some of the transcripts have been grammatically annotated, i.e. a portion of the data has a multiple layers of annotation: audio (per-word segmentation), lexical, and morpho-syntactic. In total RSC contains 10GB or more than 40 hours of speech.

To cite KLC please use the following:

Olzhas Makhambetov, Aibek Makazhanov, Zhandos Yessenbayev, Bakhyt Matkarimov, Islam Sabyrgaliyev, and Anuar Sharafudinov. 2013. Assembling the Kazakh Language Corpus. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1022–1031, Seattle, Washington, USA, October. Association for Computational Linguistics.

The KLC project is managed by Natural Language and Information Processing group of the Computer Science Lab of the Nazarbayev University Research and Innovation System. Learn more here.