In order to enhance the effectiveness of the corpus as a research tool, an annotation of a portion of the data for syntactic and POS tags, lemmata, and morpheme types and boundaries was performed. Given that at the moment of writing the annotation process is still in progress, Table 1 provides net amount and the percentages (with respect to the current size of the corpus) of the annotated data in terms of documents, words, unique words, and lemmata.

Table 1
documents, total 927
documents, % 0.2
all words, total 499 992
all words, % 0.4
unique word , total 75 189
unique word , % 5.5
lemmata, total 31 711

The annotation process had been carried out completely manually. A manual annotation was favored over a semi-automatic one for the following two reasons: (i) finding language independent tools (not to mention Kazakh-specific) which support a fine grained level of annotation turned out to be rather challenging; (ii) the annotators were provided with a semi-automatic-like annotation experience by equipping the annotation tool with a fairly advanced recommendation system. The annotation was performed mainly by the undergraduate students majoring in Kazakh philology. As a quality control measure, two validators were assigned to check a random sample of about 10% of the annotated data.

The syntactic tagset. The syntactic tagset comprises a compact set of syntactic categories well-defined in a classical grammar. Table 2 contains the tagset description along with the equivalent tags defined in a widely used Penn Treebank tagset. In addition to that, proverbials are also labeled which are rather common elements of Kazakh language. They were not treated as a separate syntactic category, for they typically serve as a single syntactic unit (e.g. predicate, adverbial, clause, etc.) Instead each syntactic tag has a corresponding binary property that marks the proverbial case.

The POS tagset. Kazakh is an agglutinative Turkic language, in which word forms are generated by means of the affix inflection. Different affix types mark different linguistic properties. For instance, consider a translation of a simple Kazakh sentence:
Mektepke bardym. - I went to school.
In this example pronoun "I" and preposition "to" are "hidden" in the affixes of case and person, i.e.:
Mektep(NN = a school) + ke (dative case = to school)
bar(VB, imperative = go) + dy (past tense = he/she went) + m (1st person = I went)

As the example shows, inflected affix chains contain important information that is not always present in the context, hence a tagset should be designed in a way to capture this information to the extent possible. For this reason, a positional tagset was designed, in which the final tags are constructed by the concatenation of the basic tag (often POS of a word form) and the encoded chains of linguistic properties (LPs). Table 3 contains main LPs defined in Kazakh grammar along with their codes and cardinalities, i.e. a number of values they accept.

Table 2 The syntactic tagset description Table 3: POS tagset design
TagDescriptionPTB equivalents
SSimple declarative clauseS
BSSIndependent clauseS
BGSDependent clauseSBAR(Q)
XVoid, unknown, uncertainX
#Linguistic propertyCodeCardinality

Figure 1 provides a detailed description of the designed tagset (not including punctuation) both qualitatively and quantitatively. The table contains a list of tags grouped by the ten major POS (in bold). For each tag there is an accepted set of LPs and generative capacities, i.e. the upper bound on a number of possible tags that can be generated from a given basic tag and the different combinations of the corresponding LPs. The list of 36 basic tags was compiled following the best practices of Penn tagset design, and bearing in mind the specifics of Kazakh grammar.

The maximum size of the tagset, equals to the total generative capacity, or 3844 tags. However, depending on the level of granularity required for the application, some or even all LPs may be dropped or added back in, providing additional flexibility.

Figure 1:

Given the designed tagset the aforementioned Kazakh sentence can be tagged as follows: Mektepke/ZEP_A0N0S0P3C3 (ZEP - non-personal noun; A0 - inanimate; N0 - singular; S0 - no possessor; P3 - 3rd person; C3 - dative case) bardym/ET_G0T3M1V0P1 (ET - regular verb; G0 - not negated; T3 - past tense; M1 - indicative mood; V0 - active voice; P1 - 1st person) ./.

To ease the process of annotation a special tool has been developed, it was designed as a web application with a logging and a document management system, that allows for (auto)saving current work and reviewing and revising the already annotated documents.

Here are the links for try out:

Here are the links for annotated and validated documents: