Loading...
Projects / Programmes source: ARIS

Basic Research for the Development of Spoken Language Resources and Speech Technologies for the Slovenian Language

Research activity

Code Science Field Subfield
6.05.00  Humanities  Linguistics   

Code Science Field
6.02  Humanities  Languages and Literature 
Keywords
spoken language resources, spoken language, research of speech, language technologies, speech technologies, corpus lingustics, lexicography
Evaluation (metodology)
source: COBISS
Points
22,672.7
A''
4,087.89
A'
8,887.94
A1/2
12,823.81
CI10
7,664
CImax
496
h10
40
A1
77.67
A3
23.34
Data for the last 5 years (citations for the last 10 years) on November 12, 2025; Data for score A3 calculation refer to period 2020-2024
Data for ARIS tenders ( 04.04.2019 – Programme tender, archive )
Database Linked records Citations Pure citations Average pure citations
WoS  287  3,716  3,400  11.85 
Scopus  564  7,249  6,362  11.28 
Organisations (9) , Researchers (39)
0796  University of Maribor, Faculty of Electrical Engineering and Computer Science
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  53072  Špela Antloga  Linguistics  Researcher  2022 - 2025  75 
2.  54519  MSc Andreja Bizjak  Linguistics  Researcher  2022 - 2025  32 
3.  33286  PhD Gregor Donaj  Telecommunications  Researcher  2022 - 2025  92 
4.  51357  Simona Majhenič  Linguistics  Researcher  2022 - 2024  45 
5.  50218  PhD Grega Močnik  Telecommunications  Researcher  2022 - 2025  47 
6.  18168  PhD Mirjam Sepesy Maučec  Telecommunications  Researcher  2022 - 2025  266 
7.  23838  PhD Darinka Verdonik  Linguistics  Head  2022 - 2025  224 
8.  20032  PhD Andrej Žgank  Telecommunications  Researcher  2022 - 2025  254 
0106  Jožef Stefan Institute
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  05023  PhD Tomaž Erjavec  Linguistics  Researcher  2022 - 2025  695 
2.  55962  Taja Kuzman Pungeršek  Linguistics  Researcher  2022 - 2025  114 
3.  36871  PhD Nikola Ljubešić  Linguistics  Researcher  2022 - 2025  476 
4.  56348  Peter Rupnik    Technical associate  2022 - 2025  97 
0581  University of Ljubljana, Faculty of Arts
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  27674  PhD Špela Arhar Holdt  Linguistics  Researcher  2022 - 2025  280 
2.  36914  PhD Jaka Čibej  Linguistics  Researcher  2022 - 2025  206 
3.  36491  PhD Kaja Dobrovoljc  Linguistics  Researcher  2024 - 2025  201 
4.  16313  PhD Apolonija Gantar  Linguistics  Researcher  2022 - 2025  235 
5.  33796  PhD Iztok Kosem  Linguistics  Researcher  2022 - 2025  350 
6.  26166  PhD Simon Krek  Linguistics  Researcher  2022 - 2025  421 
7.  57100  Nejc Robida  Linguistics  Researcher  2022 - 2025  31 
8.  05799  PhD Vera Smole  Linguistics  Researcher  2022 - 2025  534 
9.  19059  PhD Mojca Smolej  Humanities  Researcher  2022 - 2025  386 
10.  11651  PhD Marko Stabej  Linguistics  Researcher  2022 - 2025  656 
0618  Research Centre of the Slovenian Academy of Sciences and Arts
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  15689  PhD Helena Dobrovoljc  Linguistics  Researcher  2022 - 2025  416 
2.  32205  PhD Januška Gostenčnik  Linguistics  Researcher  2022 - 2025  145 
3.  37555  PhD Janoš Ježovnik  Linguistics  Researcher  2022 - 2025  128 
4.  10288  PhD Carmen Kenda-Jež  Linguistics  Researcher  2022 - 2025  320 
5.  34592  PhD Tanja Mirtič  Linguistics  Researcher  2023 - 2025  103 
6.  10353  PhD Jožica Škofic  Linguistics  Researcher  2022 - 2025  711 
1538  University of Ljubljana, Faculty of Electrical Engineering
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  11805  PhD Simon Dobrišek  Computer science and informatics  Researcher  2022 - 2025  296 
2.  31985  PhD Janez Križaj  Systems and cybernetics  Researcher  2022 - 2025  47 
3.  21310  PhD Janez Perš  Systems and cybernetics  Researcher  2025  257 
1539  University of Ljubljana, Faculty of Computer and Information Science
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  16154  PhD Marko Bajec  Computer science and informatics  Researcher  2022 - 2025  507 
2.  21404  PhD Iztok Lebar Bajec  Computer science and informatics  Researcher  2022 - 2025  200 
1822  University of Primorska, Faculty of Humanities
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  32126  PhD Klara Šumenjak  Linguistics  Researcher  2022 - 2025  60 
2.  27530  PhD Jana Volk  Linguistics  Researcher  2022 - 2025  139 
1986  ALPINEON R & D
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  12000  PhD Jerneja Žganec Gros  Computer science and informatics  Researcher  2022 - 2025  292 
2565  University of Maribor Faculty of Arts
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  12507  PhD Mihaela Koletnik  Linguistics  Researcher  2022 - 2025  560 
2.  20763  PhD Mira Krajnc Ivič  Humanities  Researcher  2022 - 2025  253 
3.  18502  PhD Melita Zemljak Jontes  Linguistics  Researcher  2022 - 2025  513 
Abstract
Spoken language resources are scarce and underdeveloped compared to the written language resources, especially for small languages like Slovenian. To be able to perform basic research on spoken language or speech technologies with significant scientific impact, the problem of scarce spoken language resources needs to be addressed first. However, development of spoken language resources is not only a matter of applied data collection but opens up a number of basic research questions. These research questions will be addressed in this project, with focus on the Slovenian language. This is a big project proposal and is divided into 4 Work Packages (WPs), each including 2-4 tasks, 14 tasks all together. 4 tasks are solely linguistic, 2 tasks are solely technical, while the majority of the tasks (8) are interdisciplinary. The specific objectives of WPs and their corresponding tasks are as follows: WP1 ACQUIRING RECORDINGS OF SPEECH - Objective 1.1: Analyse the needs for spoken language resources in different linguistic and technical disciplines. - Objective 1.2 Analyse advantages and disadvantages of different recording techniques, with particular attention to crowdsourcing as time- and money-efficient technique. - Objective 1.3 Evaluation of the efficiency of speech recognition models trained on domain specific speech data obtained with low-cost unsupervised or semi-supervised techniques compared to general domain data obtained with high-cost techniques. - Objective 1.4 Identify speech/speaker tasks that need further investment into labelled data for Slovene speech recognition. WP2: DIALECT VARIATION - Objective 2.1 Geolinguistic analysis of selected phonetic features, creation of diachronic phonetic maps of the non-standard phonetic inventory, creation of a proposal for the standardisation of Slovenian dialect transcription and its conversion into IPA (and SAMPA). - Objective 2.2 Creation of synthetic synchronic phonetic maps to define the areas of non-standard phonemes in Slovenian dialects. Making recommendations to improve pronunciation-based transcription for the Slovenian spoken corpus. - Objective 2.3 The creation and testing of diasystemic contrastive Tables of phonemes (dialect vs. standard). Establishement of transcription standards for phonetic transcription for spoken corpora - Objective 2.4 Definition and evaluation of an optimal Slovenian phoneme set for Speech Recognition, taking into account newly defined dialect phonemes, similarity metrics and various available speech data. WP3: SPEECH SEGMENTATION AND ANNOTATION - Objective 3.1 Evaluation of the existing speech segments/utterances in Slovene spoken language resources regarding their appropriateness as the basic units for analysis of speech on syntactic and semantic level. - Objective 3.2 The analysis of different types of disfluencies in spoken text, creation of a disfluencies training corpus and experiments for automatic annotation of disfluencies. - Objective 3.3 The development of a linguistic processing pipeline based on speech and transcription data (both manual and automatic) and linguistic annotation of the GOS 2.0 corpus. - Objective 3.4 Evaluation of the GORDAN dialogue act annotation scheme, its adjustment to the ISO 24617-2 Standard and creation of the training corpus with dialogue acts` annotations. WP4: SPOKEN LEXIS - Objective 4.1 The evaluation of existing information on spoken Slovene in the Sloleks lexicon, and the creation of linguistically sound guidelines for the inclusion of (non-standard) spoken data in Sloleks, comparable with machine-readable lexicons for other languages. - Objective 4.2 Analysis of existing semantic information included in lexicographic resources for Slovene from the perspective of spoken Slovene, together with the analysis of the complementary spoken corpus data, and exploration of the principles of inclusion of the findings in lexicographic resources.
Views history
Favourite