Projects / Programmes source: ARIS

New grammar of modern standard Slovene: resources and methods

Research activity

Code Science Field Subfield
6.05.00  Humanities  Linguistics   

Code Science Field
H352  Humanities  Grammar, semantics, semiotics, syntax 

Code Science Field
6.02  Humanities  Languages and Literature 
grammar, corpus linguistics, computational linguistics, valency, multi word expressions, collocations
Evaluation (rules)
source: COBISS
Researchers (13)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  27674  PhD Špela Arhar Holdt  Linguistics  Researcher  2017 - 2020  227 
2.  22278  PhD Janez Brank  Computer science and informatics  Researcher  2017 - 2020  94 
3.  36914  PhD Jaka Čibej  Linguistics  Researcher  2018 - 2020  151 
4.  36491  PhD Kaja Dobrovoljc  Linguistics  Researcher  2018 - 2020  142 
5.  16313  PhD Apolonija Gantar  Linguistics  Researcher  2017 - 2020  216 
6.  52176  Teja Goli    Technical associate  2019 - 2020  13 
7.  14681  PhD Vojko Gorjanc  Linguistics  Researcher  2017 - 2020  477 
8.  32887  MSc Bojan Klemenc  Computer science and informatics  Technical associate  2017 - 2020  54 
9.  33796  PhD Iztok Kosem  Linguistics  Researcher  2017 - 2020  296 
10.  26166  PhD Simon Krek  Linguistics  Head  2017 - 2020  358 
11.  37653  PhD Cyprian Adam Laskowski  Linguistics  Researcher  2017 - 2020  35 
12.  15295  PhD Marko Robnik Šikonja  Computer science and informatics  Researcher  2017 - 2020  417 
13.  37487  Katja Zupan  Linguistics  Junior researcher  2017 - 2020  22 
Organisations (3)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0106  Jožef Stefan Institute  Ljubljana  5051606000  89,990 
2.  0581  University of Ljubljana, Faculty of Arts  Ljubljana  1627058  97,066 
3.  1539  University of Ljubljana, Faculty of Computer and Information Science  Ljubljana  1627023  16,002 
The project aims to explore linguistic methodological foundations of a complex analysis of written and spoken Slovene, as found in the new corpora developed in recent projects. Resulting methodology and data will provide a solid starting point for future work on empirically based description of Slovene. Following from the methodology we intend to compile and publish extensive collections of extracted material from corpora which will be useful for the development of language technology applications for Slovene. The extracted data will be used for a linguistic analysis of real language, which represents the first step towards the compilation of a new descriptive corpus grammar of Slovene. The project proposal is based on the fact that in the last three decades language description has witnessed a noticeable paradigm shift from researching language as a system, on the level of phonology and (morpho)syntax, to a more empirically-oriented language description which aims to describe workings of language in real life, and is linked to fields such as psychology, neurobiology, artificial intelligence etc. To enable research within the new paradigm, reliable empirical data about different language phenomena are needed which are provided by modern computational or corpus linguistics, using automatic methods to analyse extensive collections of written and spoken language data. Project work plan is divided into several work packages whose titles reveal the types of proposed corpus analyses: Morphology and word formation, Collocations, Multi word expressions, Valency and Formulaic sequences. Written language will be analysed primarily by using the Kres corpus, along with comparative data from Gigafida. Kres is the reference corpus with 100 million words sampled and balanced from Gigafida. Spoken language will be analysed by using the Gos corpus containing 1 million words of transcribed Slovene speech the SST corpus  with manual syntactic dependency annotations. All extracted data collections, programs and algorithms will be published under open access or open source licenses, and structured with the intent to be useful for language technology applications.
Significance for science
The project provides the basis for compilation of a new Slovene grammar and other important works (collocations, valency, word-formation dictionary), and it provides relevant data for research on different topics in modern standard Slovene language, from morphology to stylistics. Communication type of description of Slovene will bridge the widening gap between the discursive reality of the language and its current description. This means that the results will be important not only for linguistics but also for Slovene language teaching on all levels. Results of the project will include numerous extensive databases in various formats (XML, tables, lists etc.) under the Creative Commons–Attribution licence. The resulting development of language tools will enable easier analysis and exploitation of information contained in unstructured Slovene texts. These tools can be used in a number of new products (e.g. recommendation systems, virtual assistents, speach recognition and synthesis) and in many different domains (e.g. medicine, security, transport) which is strategically important for the productivity and competitiveness of Slovene companies and public sector. The lack of grammatical description of modern Slovene is emphasized also in the new Resolution on the national programme for language policy 2014-2018: "in the period covered by the resolution it is necessary to start planning the compilation of a scientific grammar of Slovene which will describe contemporary grammatical composition of standard Slovene as a cohesive language variant of all Slovenes." One of the important goals of the resolution is also "promotion of the development of language technologies for Slovene which includes the development of necessary infrastructure and freely available resources and tools for Slovene."
Significance for the country
Language technologies are one of the important enabling technologies in today’s information society, they can be found in all applications that require interaction between humans and machines or acquisition of knowledge from large data resources in Slovene. The research conducted in the proposed project will make an important contribution to the integration of Slovene into products that use these services, e.g. those described in the Smart specialization strategy (smart cities).  The impact of the research results will be directly and indirectly visible mainly in the field of language infrastructure for Slovene. It is anticipated that the project will contribute to successful participation of Slovene in state-of-the-art technological trends, which demand automatic language processing for different applications, from virtual assistants (e.g. Siri, Cortana, Alexa), machine translation systems, to artificial intelligence. In these applications, Slovene will need to be on the same level as languages with considerably higher numbers of speakers; this cannot be achieved without research focused on specific characteristics of Slovene in terms of language technology needs. Considering the fact that the members of the research group are already involved in international research, especially in lexicography and machine learning, we expect that the results will achieve international impact and recognition and be relevant for other languages.
Most important scientific results Interim report, final report
Most important socioeconomically and culturally relevant results Final report
Views history