New grammar of modern standard Slovene: resources and methods

grammar, corpus linguistics, computational linguistics, valency, multi word expressions, collocations
The project aims to explore linguistic methodological foundations of a complex analysis of written and spoken Slovene, as found in the new corpora developed in recent projects. Resulting methodology and data will provide a solid starting point for future work on empirically based description of Slovene. Following from the methodology we intend to compile and publish extensive collections of extracted material from corpora which will be useful for the development of language technology applications for Slovene. The extracted data will be used for a linguistic analysis of real language, which represents the first step towards the compilation of a new descriptive corpus grammar of Slovene. The project proposal is based on the fact that in the last three decades language description has witnessed a noticeable paradigm shift from researching language as a system, on the level of phonology and (morpho)syntax, to a more empirically-oriented language description which aims to describe workings of language in real life, and is linked to fields such as psychology, neurobiology, artificial intelligence etc. To enable research within the new paradigm, reliable empirical data about different language phenomena are needed which are provided by modern computational or corpus linguistics, using automatic methods to analyse extensive collections of written and spoken language data. Project work plan is divided into several work packages whose titles reveal the types of proposed corpus analyses: Morphology and word formation, Collocations, Multi word expressions, Valency and Formulaic sequences. Written language will be analysed primarily by using the Kres corpus, along with comparative data from Gigafida. Kres is the reference corpus with 100 million words sampled and balanced from Gigafida. Spoken language will be analysed by using the Gos corpus containing 1 million words of transcribed Slovene speech the SST corpus  with manual syntactic dependency annotations. All extracted data collections, programs and algorithms will be published under open access or open source licenses, and structured with the intent to be useful for language technology applications.
Significance for science
The project provides the basis for compilation of a new Slovene grammar and other important works (collocations, valency, word-formation dictionary), and it provides relevant data for research on different topics in modern standard Slovene language, from morphology to stylistics. Communication type of description of Slovene will bridge the widening gap between the discursive reality of the language and its current description. This means that the results will be important not only for linguistics but also for Slovene language teaching on all levels. Results of the project will include numerous extensive databases in various formats (XML, tables, lists etc.) under the Creative Commons–Attribution licence. The resulting development of language tools will enable easier analysis and exploitation of information contained in unstructured Slovene texts. These tools can be used in a number of new products (e.g. recommendation systems, virtual assistents, speach recognition and synthesis) and in many different domains (e.g. medicine, security, transport) which is strategically important for the productivity and competitiveness of Slovene companies and public sector. The lack of grammatical description of modern Slovene is emphasized also in the new Resolution on the national programme for language policy 2014-2018: "in the period covered by the resolution it is necessary to start planning the compilation of a scientific grammar of Slovene which will describe contemporary grammatical composition of standard Slovene as a cohesive language variant of all Slovenes." One of the important goals of the resolution is also "promotion of the development of language technologies for Slovene which includes the development of necessary infrastructure and freely available resources and tools for Slovene."
Significance for the country
Language technologies are one of the important enabling technologies in today’s information society, they can be found in all applications that require interaction between humans and machines or acquisition of knowledge from large data resources in Slovene. The research conducted in the proposed project will make an important contribution to the integration of Slovene into products that use these services, e.g. those described in the Smart specialization strategy (smart cities).  The impact of the research results will be directly and indirectly visible mainly in the field of language infrastructure for Slovene. It is anticipated that the project will contribute to successful participation of Slovene in state-of-the-art technological trends, which demand automatic language processing for different applications, from virtual assistants (e.g. Siri, Cortana, Alexa), machine translation systems, to artificial intelligence. In these applications, Slovene will need to be on the same level as languages with considerably higher numbers of speakers; this cannot be achieved without research focused on specific characteristics of Slovene in terms of language technology needs. Considering the fact that the members of the research group are already involved in international research, especially in lexicography and machine learning, we expect that the results will achieve international impact and recognition and be relevant for other languages.
