Projects / Programmes
Linguistic annotation of Slovene language: methods and resources
Code |
Science |
Field |
Subfield |
2.07.07 |
Engineering sciences and technologies |
Computer science and informatics |
Intelligent systems - software |
Code |
Science |
Field |
P176 |
Natural sciences and mathematics |
Artificial intelligence |
language technologies, Slovene language, language resources
Researchers (5)
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
05023 |
PhD Tomaž Erjavec |
Linguistics |
Head |
2007 - 2009 |
636 |
2. |
17137 |
Marko Grobelnik |
Computer science and informatics |
Technical associate |
2007 - 2009 |
439 |
3. |
18947 |
PhD Nataša Hirci |
Linguistics |
Researcher |
2009 |
147 |
4. |
26166 |
PhD Simon Krek |
Linguistics |
Researcher |
2007 - 2009 |
373 |
5. |
12570 |
PhD Dunja Mladenić |
Computer science and informatics |
Researcher |
2007 - 2009 |
662 |
Organisations (2)
Abstract
The project will develop automatic inductive methods and tools for morphosyntactic, syntactic and semantic annotation, which will be used for building manually corrected and publicly accessible Slovene language resources, namely annotated corpora and lexicons. These results will provide the urgently needed infrastructure for further development of language technologies for Slovene. As these resources will be accessible not only to the project members, but to any research team in Slovenia and abroad, they are expected to act as a catalyst for R&D in the field of language technologies for the Slovene language, an area that is of vital importance for effective use of Slovene in the Information Society. The project comprises four work packages. The first horizontal work package addresses technical and legal aspects of resource accessibility, i.e. making resources available to developers for use as learning and testing datasets, and to linguists for research on Slovene. The remaining three work packages are concerned with three levels of linguistic analysis. The first is morphosyntactic tagging and the related lemmatization, which is the basic level of annotation indispensable to virtually every language-oriented computer program; the project will improve on existing methods and produce an annotated corpus, manually checked for errors. The second level comprising automatic syntactic analysis is of key importance for in-depth text analyses, since it reveals the interdependence of syntactic units. The project will produce a syntactically annotated corpus and a valency lexicon, both hand corrected, and a syntactic parser for Slovene. The last level deals with lexical semantics of Slovene, needed e.g. in machine translation and information search. The project will upgrade the existing semantic lexicon (ontology) for Slovene, annotate a corpus using concepts from this lexicon and develop methods for automatic ontology building and disambiguation of polysemous lexemes. The project will draw on ample experience of the project partners in the development of Slovene language resources and machine learning. The point of departure will be the morphosyntactically annotated reference corpus Fida PLUS, the syntactically annotated prototype corpus SDT and the prototype semantic lexicon sloWNet. Work in the project will be closely tied to simultaneous Slovene and EU projects concerned with the development of machine learning methods for machine translation and ontology building.
Significance for science
The modules and technology developed in this project position the Slovene language in the family of languages with at least a basic computerised language infrastructure. This enables further research on Slovene texts, in Slovenia as well as in a wider European context. The project belongs to the scientific discipline of computational linguistics, where it advances the state of the art in the following fields:
Development of methods for machine learning of language models: for the development of technology, which serves in the production of the project software modules we used some of the state-of-the-art methods for the analysis of unstructured and partially structured data – these methods have been taken primarily from the field of machine learning, which has made significant strides in this direction in the last few years. Due to the specificity of Slovene (esp. compared to English) many of the existing methods have not been useful without adaptations. In the course of the project we developed these adaptations, evaluated them and used them for the final software modules. The project has developed machine learning methods for disambiguation of word-level morphosyntactic tags and for the purpose of lemmatisation. Additional advances have been made in the combination of various learning methods in order to achieve better accuracy of taggers.
Research on empirically-grounded linguistic analyses of several levels of the Slovene language:
linguistics in Slovenia is, to a large extent, still bound to the generative paradigm, which is based on introspection and »artificial« examples used in the analyses. The project offer alternatives, where the examples are taken from actual language, so it supports the development of contemporary, empirically based linguistics.
Development in the area of encoding and standardisation of linguistic data: given the growing complexity of analytical annotations added to corpora, the area of annotated vocabularies, coding and annotation combination has been attracting increasing interest. The resources developed in this project combine three levels of linguistic annotation; the project had to ensure that the tools can operate on these annotations and offered standardised corpora that contain them. Therefore it was imperative to take into account international standards and recommendations in these areas. The project also recommended new solutions (esp. TEI P5 and MULTEXT-East based), which represent a scientific advance in this area.
Significance for the country
Just as it used to be important to have books written in one’s own language, then newspapers, and later electronic media such as radio, television and Internet, it is today imperative to have computer support for a language. Due to the specifics of languages and cultures, this task can only be accomplished by native speakers. The development of sufficient computational infrastructure is a prerequisite for a language to belong to the family of languages which are already developing methods of analysis that go beyond the lexical and syntactic levels. It could be said that one of the possible views on the importance of a language on a global scale is its accessibility and connectedness with other languages. Without widely available results, such as which we have ensured in the project, the Slovene language will have difficulties in attaining this connection.
The developed resources will also help in preserving cultural heritage, as the foreseen language technology development ensures that the materials which define the language heritage of Slovene become much closer and more accessible to the general public than they would be otherwise.
A key part of the project is to maximise the impact of its results, by making all the developed Slovene language resources freely available. The foreseen users of these resources are:
• The developers of language technologies, as they are able to process texts in Slovene at a technological level similar to that available for other, »larger« languages. The use of developed technologies will enable Slovene academic and commercial partners to participate in projects and global cooperation with their own contributions that support work with the Slovene language.
• Linguists, esp. those studying the Slovene language, who are now able to annotate their own texts, analyse the developed resources via Internet tools, as well as having the option of downloading the complete dataset for research with their own analytic tools.
• Indirectly, all »users« of the Slovene language, as the project results stimulate the development of language technologies for Slovene, and hence the development of directly usable applications, such as information retrieval, machine translation, speech synthesis and analysis, etc.
Most important scientific results
Annual report
2008,
final report,
complete report on dLib.si
Most important socioeconomically and culturally relevant results
Annual report
2008,
final report,
complete report on dLib.si