Jezikoslovno označevanje slovenskega jezika: metode in viri (Slovene)

Code

J2-9180 (C) - included in ARIS records

Head

PhD Tomaž Erjavec

Period

1/1/2007 - 12/31/2009

Range in 2009

0.62 FTE

Science

Engineering sciences and technologies (2)
Humanities (3)

Reseacher status

Researcher (4)
Junior expert or technical associate (1)

Education

Doctoral degree (4)
Other (1)

Sex

Woman (2)
Man (3)

Status

Employed at RO and RRD (5)

No. of publications

100–999 (5)

Projects / Programmes source: ARIS

Jezikoslovno označevanje slovenskega jezika: metode in viri (Slovene)

Research activity

Code	Science	Field	Subfield
2.07.07	Engineering sciences and technologies	Computer science and informatics	Intelligent systems - software

Code	Science	Field
P176	Natural sciences and mathematics	Artificial intelligence

Keywords

language technologies, Slovene language, language resources

Evaluation (metodology)

Evaluation of bibliographic research performance indicators according to ARIS methodology

Citations Citations for bibliographic records in COBIB.SI that are linked to records in citation databases

Organisations (2) , Researchers (6)

0106 Jožef Stefan Institute

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	05023	PhD Tomaž Erjavec	Linguistics	Head	2007 - 2009	710
2.	17137	Marko Grobelnik	Computer science and informatics	Technical associate	2007 - 2009	502
3.	26166	PhD Simon Krek	Linguistics	Researcher	2007 - 2009	433
4.	12570	PhD Dunja Mladenić	Computer science and informatics	Researcher	2007 - 2009	720

0581 University of Ljubljana, Faculty of Arts

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	18947	PhD Nataša Hirci	Linguistics	Researcher	2009	160
2.	26166	PhD Simon Krek	Linguistics	Researcher	2007 - 2009	433

Abstract

The project will develop automatic inductive methods and tools for morphosyntactic, syntactic and semantic annotation, which will be used for building manually corrected and publicly accessible Slovene language resources, namely annotated corpora and lexicons. These results will provide the urgently needed infrastructure for further development of language technologies for Slovene. As these resources will be accessible not only to the project members, but to any research team in Slovenia and abroad, they are expected to act as a catalyst for R&D in the field of language technologies for the Slovene language, an area that is of vital importance for effective use of Slovene in the Information Society. The project comprises four work packages. The first horizontal work package addresses technical and legal aspects of resource accessibility, i.e. making resources available to developers for use as learning and testing datasets, and to linguists for research on Slovene. The remaining three work packages are concerned with three levels of linguistic analysis. The first is morphosyntactic tagging and the related lemmatization, which is the basic level of annotation indispensable to virtually every language-oriented computer program; the project will improve on existing methods and produce an annotated corpus, manually checked for errors. The second level comprising automatic syntactic analysis is of key importance for in-depth text analyses, since it reveals the interdependence of syntactic units. The project will produce a syntactically annotated corpus and a valency lexicon, both hand corrected, and a syntactic parser for Slovene. The last level deals with lexical semantics of Slovene, needed e.g. in machine translation and information search. The project will upgrade the existing semantic lexicon (ontology) for Slovene, annotate a corpus using concepts from this lexicon and develop methods for automatic ontology building and disambiguation of polysemous lexemes. The project will draw on ample experience of the project partners in the development of Slovene language resources and machine learning. The point of departure will be the morphosyntactically annotated reference corpus Fida PLUS, the syntactically annotated prototype corpus SDT and the prototype semantic lexicon sloWNet. Work in the project will be closely tied to simultaneous Slovene and EU projects concerned with the development of machine learning methods for machine translation and ontology building.

Significance for science

The modules and technology developed in this project position the Slovene language in the family of languages with at least a basic computerised language infrastructure. This enables further research on Slovene texts, in Slovenia as well as in a wider European context.  The project belongs to the scientific discipline of computational linguistics, where it advances the state of the art in the following fields:

Development of methods for machine learning of language models:  for the development of technology, which serves in the production of the project software modules we used some of the state-of-the-art methods for the analysis of unstructured and partially structured data – these methods have been taken primarily from the field of machine learning, which has made significant strides in this direction in the last few years. Due to the specificity of Slovene (esp. compared to English) many of the existing methods have not been useful without adaptations. In the course of the project we developed these adaptations, evaluated them and used them for the final software modules. The project has developed machine learning methods for disambiguation of word-level morphosyntactic tags and for the purpose of lemmatisation. Additional advances have been made in the combination of various learning methods in order to achieve better accuracy of taggers.

Research on empirically-grounded linguistic analyses of several levels of the Slovene language: 
linguistics in Slovenia is, to a large extent, still bound to the generative paradigm, which is based on introspection and »artificial« examples used in the analyses. The project offer alternatives, where the examples are taken from actual language, so it supports the development of contemporary, empirically based linguistics.

Development in the area of encoding and standardisation of linguistic data:  given the growing complexity of analytical annotations added to corpora, the area of annotated vocabularies, coding and annotation combination has been attracting increasing interest. The resources developed in this project combine three levels of linguistic annotation; the project had to ensure that the tools can operate on these annotations and offered standardised corpora that contain them. Therefore it was imperative to take into account international standards and recommendations in these areas. The project also recommended new solutions (esp. TEI P5 and MULTEXT-East based), which represent a scientific advance in this area.

Significance for the country

Just as it used to be important to have books written in one’s own language, then newspapers, and later electronic media such as radio, television and Internet, it is today imperative to have computer support for a language. Due to the specifics of languages and cultures, this task can only be accomplished by native speakers. The development of sufficient computational infrastructure is a prerequisite for a language to belong to the family of languages which are already developing methods of analysis that go beyond the lexical and syntactic levels. It could be said that one of the possible views on the importance of a language on a global scale is its accessibility and connectedness with other languages. Without widely available results, such as which we have ensured in the project, the Slovene language will have difficulties in attaining this connection.
The developed resources will also help in preserving cultural heritage, as the foreseen language technology development ensures that the materials which define the language heritage of Slovene become much closer and more accessible to the general public than they would be otherwise.

A key part of the project is to maximise the impact of its results, by making all the developed Slovene language resources freely available. The foreseen users of these resources are:
• The developers of language technologies, as they are able to process texts in Slovene at a technological level similar to that available for other, »larger« languages. The use of developed technologies will enable Slovene academic and commercial partners to participate in projects and global cooperation with their own contributions that support work with the Slovene language.
• Linguists, esp. those studying the Slovene language, who are now able to annotate their own texts, analyse the developed resources via Internet tools, as well as having the option of downloading the complete dataset for research with their own analytic tools.
• Indirectly, all »users« of the Slovene language, as the project results stimulate the development of language technologies for Slovene, and hence the development of directly usable applications, such as information retrieval, machine translation, speech synthesis and analysis, etc.

Most important scientific results

Annual report 2008, final report

Most important socioeconomically and culturally relevant results

Annual report 2008, final report

Jezikoslovno označevanje slovenskega jezika: metode in viri (Slovene)

Views history

Favourite

Jezikoslovno označevanje slovenskega jezika: metode in viri (Slovene)

FRASCATI classification

CERIF classification

Confirmation required

Views history

Favourite