Loading...
Projects / Programmes source: ARIS

Linguistic annotation of Slovene language: methods and resources

Research activity

Code Science Field Subfield
2.07.07  Engineering sciences and technologies  Computer science and informatics  Intelligent systems - software 

Code Science Field
P176  Natural sciences and mathematics  Artificial intelligence 
Keywords
language technologies, Slovene language, language resources
Evaluation (rules)
source: COBISS
Researchers (5)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  05023  PhD Tomaž Erjavec  Linguistics  Head  2007 - 2009  636 
2.  17137  Marko Grobelnik  Computer science and informatics  Technical associate  2007 - 2009  439 
3.  18947  PhD Nataša Hirci  Linguistics  Researcher  2009  147 
4.  26166  PhD Simon Krek  Linguistics  Researcher  2007 - 2009  373 
5.  12570  PhD Dunja Mladenić  Computer science and informatics  Researcher  2007 - 2009  662 
Organisations (2)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0106  Jožef Stefan Institute  Ljubljana  5051606000  90,682 
2.  0581  University of Ljubljana, Faculty of Arts  Ljubljana  1627058  97,913 
Abstract
The project will develop automatic inductive methods and tools for morphosyntactic, syntactic and semantic annotation, which will be used for building manually corrected and publicly accessible Slovene language resources, namely annotated corpora and lexicons. These results will provide the urgently needed infrastructure for further development of language technologies for Slovene. As these resources will be accessible not only to the project members, but to any research team in Slovenia and abroad, they are expected to act as a catalyst for R&D in the field of language technologies for the Slovene language, an area that is of vital importance for effective use of Slovene in the Information Society. The project comprises four work packages. The first horizontal work package addresses technical and legal aspects of resource accessibility, i.e. making resources available to developers for use as learning and testing datasets, and to linguists for research on Slovene. The remaining three work packages are concerned with three levels of linguistic analysis. The first is morphosyntactic tagging and the related lemmatization, which is the basic level of annotation indispensable to virtually every language-oriented computer program; the project will improve on existing methods and produce an annotated corpus, manually checked for errors. The second level comprising automatic syntactic analysis is of key importance for in-depth text analyses, since it reveals the interdependence of syntactic units. The project will produce a syntactically annotated corpus and a valency lexicon, both hand corrected, and a syntactic parser for Slovene. The last level deals with lexical semantics of Slovene, needed e.g. in machine translation and information search. The project will upgrade the existing semantic lexicon (ontology) for Slovene, annotate a corpus using concepts from this lexicon and develop methods for automatic ontology building and disambiguation of polysemous lexemes. The project will draw on ample experience of the project partners in the development of Slovene language resources and machine learning. The point of departure will be the morphosyntactically annotated reference corpus Fida PLUS, the syntactically annotated prototype corpus SDT and the prototype semantic lexicon sloWNet. Work in the project will be closely tied to simultaneous Slovene and EU projects concerned with the development of machine learning methods for machine translation and ontology building.
Significance for science
The modules and technology developed in this project position the Slovene language in the family of languages with at least a basic computerised language infrastructure. This enables further research on Slovene texts, in Slovenia as well as in a wider European context. The project belongs to the scientific discipline of computational linguistics, where it advances the state of the art in the following fields: Development of methods for machine learning of language models: for the development of technology, which serves in the production of the project software modules we used some of the state-of-the-art methods for the analysis of unstructured and partially structured data – these methods have been taken primarily from the field of machine learning, which has made significant strides in this direction in the last few years. Due to the specificity of Slovene (esp. compared to English) many of the existing methods have not been useful without adaptations. In the course of the project we developed these adaptations, evaluated them and used them for the final software modules. The project has developed machine learning methods for disambiguation of word-level morphosyntactic tags and for the purpose of lemmatisation. Additional advances have been made in the combination of various learning methods in order to achieve better accuracy of taggers. Research on empirically-grounded linguistic analyses of several levels of the Slovene language: linguistics in Slovenia is, to a large extent, still bound to the generative paradigm, which is based on introspection and »artificial« examples used in the analyses. The project offer alternatives, where the examples are taken from actual language, so it supports the development of contemporary, empirically based linguistics. Development in the area of encoding and standardisation of linguistic data: given the growing complexity of analytical annotations added to corpora, the area of annotated vocabularies, coding and annotation combination has been attracting increasing interest. The resources developed in this project combine three levels of linguistic annotation; the project had to ensure that the tools can operate on these annotations and offered standardised corpora that contain them. Therefore it was imperative to take into account international standards and recommendations in these areas. The project also recommended new solutions (esp. TEI P5 and MULTEXT-East based), which represent a scientific advance in this area.
Significance for the country
Just as it used to be important to have books written in one’s own language, then newspapers, and later electronic media such as radio, television and Internet, it is today imperative to have computer support for a language. Due to the specifics of languages and cultures, this task can only be accomplished by native speakers. The development of sufficient computational infrastructure is a prerequisite for a language to belong to the family of languages which are already developing methods of analysis that go beyond the lexical and syntactic levels. It could be said that one of the possible views on the importance of a language on a global scale is its accessibility and connectedness with other languages. Without widely available results, such as which we have ensured in the project, the Slovene language will have difficulties in attaining this connection. The developed resources will also help in preserving cultural heritage, as the foreseen language technology development ensures that the materials which define the language heritage of Slovene become much closer and more accessible to the general public than they would be otherwise. A key part of the project is to maximise the impact of its results, by making all the developed Slovene language resources freely available. The foreseen users of these resources are: • The developers of language technologies, as they are able to process texts in Slovene at a technological level similar to that available for other, »larger« languages. The use of developed technologies will enable Slovene academic and commercial partners to participate in projects and global cooperation with their own contributions that support work with the Slovene language. • Linguists, esp. those studying the Slovene language, who are now able to annotate their own texts, analyse the developed resources via Internet tools, as well as having the option of downloading the complete dataset for research with their own analytic tools. • Indirectly, all »users« of the Slovene language, as the project results stimulate the development of language technologies for Slovene, and hence the development of directly usable applications, such as information retrieval, machine translation, speech synthesis and analysis, etc.
Most important scientific results Annual report 2008, final report, complete report on dLib.si
Most important socioeconomically and culturally relevant results Annual report 2008, final report, complete report on dLib.si
Views history
Favourite