New grammar of modern standard Slovene: resources and methods

Code

J6-8256 (A) - included in ARIS records

Head

PhD Simon Krek

Period

5/1/2017 - 4/30/2020

Range in 2020

0.69 FTE

Science

Engineering sciences and technologies (3)
Humanities (9)
Other (1)

Reseacher status

Researcher (12)
Junior expert or technical associate (1)

Education

Doctoral degree (10)
Master's degree (1)
Other (2)

Sex

Woman (5)
Man (8)

Status

Employed at RO and RRD (11)
No data on employment in RO (2)

No. of publications

10–99 (4)
100–999 (9)

Projects / Programmes source: ARIS

New grammar of modern standard Slovene: resources and methods

Research activity

Code	Science	Field	Subfield
6.05.00	Humanities	Linguistics

Code	Science	Field
H352	Humanities	Grammar, semantics, semiotics, syntax

Code	Science	Field
6.02	Humanities	Languages and Literature

Keywords

grammar, corpus linguistics, computational linguistics, valency, multi word expressions, collocations

Evaluation (metodology)

Evaluation of bibliographic research performance indicators according to ARIS methodology

Citations Citations for bibliographic records in COBIB.SI that are linked to records in citation databases

Organisations (3) , Researchers (13)

0106 Jožef Stefan Institute

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	22278	PhD Janez Brank	Computer science and informatics	Researcher	2017 - 2020	108
2.	36914	PhD Jaka Čibej	Linguistics	Researcher	2018 - 2020	227
3.	36491	PhD Kaja Dobrovoljc	Linguistics	Researcher	2018 - 2020	215
4.	26166	PhD Simon Krek	Linguistics	Head	2017 - 2020	433
5.	37487	Katja Zupan	Linguistics	Young researcher	2017 - 2020	23

0581 University of Ljubljana, Faculty of Arts

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	16313	PhD Apolonija Gantar	Linguistics	Researcher	2017 - 2020	241
2.	14681	PhD Vojko Gorjanc	Linguistics	Researcher	2017 - 2020	514
3.	33796	PhD Iztok Kosem	Linguistics	Researcher	2017 - 2020	370
4.	37653	PhD Cyprian Adam Laskowski	Linguistics	Researcher	2017 - 2020	43

1539 University of Ljubljana, Faculty of Computer and Information Science

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	27674	PhD Špela Arhar Holdt	Linguistics	Researcher	2017 - 2020	310
2.	52176	Teja Goli		Technical associate	2019 - 2020	17
3.	32887	MSc Bojan Klemenc	Computer science and informatics	Technical associate	2017 - 2020	70
4.	15295	PhD Marko Robnik Šikonja	Computer science and informatics	Researcher	2017 - 2020	511

Abstract

The project aims to explore linguistic methodological foundations of a complex analysis of written and spoken Slovene, as found in the new corpora developed in recent projects. Resulting methodology and data will provide a solid starting point for future work on empirically based description of Slovene. Following from the methodology we intend to compile and publish extensive collections of extracted material from corpora which will be useful for the development of language technology applications for Slovene. The extracted data will be used for a linguistic analysis of real language, which represents the first step towards the compilation of a new descriptive corpus grammar of Slovene. The project proposal is based on the fact that in the last three decades language description has witnessed a noticeable paradigm shift from researching language as a system, on the level of phonology and (morpho)syntax, to a more empirically-oriented language description which aims to describe workings of language in real life, and is linked to fields such as psychology, neurobiology, artificial intelligence etc. To enable research within the new paradigm, reliable empirical data about different language phenomena are needed which are provided by modern computational or corpus linguistics, using automatic methods to analyse extensive collections of written and spoken language data. Project work plan is divided into several work packages whose titles reveal the types of proposed corpus analyses: Morphology and word formation, Collocations, Multi word expressions, Valency and Formulaic sequences. Written language will be analysed primarily by using the Kres corpus, along with comparative data from Gigafida. Kres is the reference corpus with 100 million words sampled and balanced from Gigafida. Spoken language will be analysed by using the Gos corpus containing 1 million words of transcribed Slovene speech the SST corpus with manual syntactic dependency annotations. All extracted data collections, programs and algorithms will be published under open access or open source licenses, and structured with the intent to be useful for language technology applications.

Significance for science

The project provides the basis for compilation of a new Slovene grammar and other important works (collocations, valency, word-formation dictionary), and it provides relevant data for research on different topics in modern standard Slovene language, from morphology to stylistics. Communication type of description of Slovene will bridge the widening gap between the discursive reality of the language and its current description. This means that the results will be important not only for linguistics but also for Slovene language teaching on all levels. Results of the project will include numerous extensive databases in various formats (XML, tables, lists etc.) under the Creative Commons–Attribution licence. The resulting development of language tools will enable easier analysis and exploitation of information contained in unstructured Slovene texts. These tools can be used in a number of new products (e.g. recommendation systems, virtual assistents, speach recognition and synthesis) and in many different domains (e.g. medicine, security, transport) which is strategically important for the productivity and competitiveness of Slovene companies and public sector.
The lack of grammatical description of modern Slovene is emphasized also in the new Resolution on the national programme for language policy 2014-2018: "in the period covered by the resolution it is necessary to start planning the compilation of a scientific grammar of Slovene which will describe contemporary grammatical composition of standard Slovene as a cohesive language variant of all Slovenes." One of the important goals of the resolution is also "promotion of the development of language technologies for Slovene which includes the development of necessary infrastructure and freely available resources and tools for Slovene."

Significance for the country

Language technologies are one of the important enabling technologies in today’s information society, they can be found in all applications that require interaction between humans and machines or acquisition of knowledge from large data resources in Slovene. The research conducted in the proposed project will make an important contribution to the integration of Slovene into products that use these services, e.g. those described in the Smart specialization strategy (smart cities).  
The impact of the research results will be directly and indirectly visible mainly in the field of language infrastructure for Slovene. It is anticipated that the project will contribute to successful participation of Slovene in state-of-the-art technological trends, which demand automatic language processing for different applications, from virtual assistants (e.g. Siri, Cortana, Alexa), machine translation systems, to artificial intelligence. In these applications, Slovene will need to be on the same level as languages with considerably higher numbers of speakers; this cannot be achieved without research focused on specific characteristics of Slovene in terms of language technology needs. Considering the fact that the members of the research group are already involved in international research, especially in lexicography and machine learning, we expect that the results will achieve international impact and recognition and be relevant for other languages.

Most important scientific results

Interim report, final report

Most important socioeconomically and culturally relevant results

Final report

New grammar of modern standard Slovene: resources and methods

Views history

Favourite

New grammar of modern standard Slovene: resources and methods

FRASCATI classification

CERIF classification

FORD classification

Confirmation required

Views history

Favourite