Loading...
Projects / Programmes source: ARIS

Resources, Tools and Methods for the Research of Nonstandard Internet Slovene

Research activity

Code Science Field Subfield
6.05.00  Humanities  Linguistics   

Code Science Field
H350  Humanities  Linguistics 

Code Science Field
6.02  Humanities  Languages and Literature 
Keywords
- corpus linguistics - language technologies - user-generated content - linguistic annotation of corpora
Evaluation (rules)
source: COBISS
Researchers (12)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  27674  PhD Špela Arhar Holdt  Linguistics  Researcher  2016 - 2017  236 
2.  36914  PhD Jaka Čibej  Linguistics  Researcher  2015 - 2017  152 
3.  05023  PhD Tomaž Erjavec  Linguistics  Researcher  2014 - 2017  636 
4.  26294  PhD Darja Fišer  Linguistics  Head  2014 - 2017  412 
5.  16313  PhD Apolonija Gantar  Linguistics  Researcher  2016 - 2017  223 
6.  14681  PhD Vojko Gorjanc  Linguistics  Researcher  2017  479 
7.  08949  PhD Nada Lavrač  Computer science and informatics  Researcher  2014 - 2017  869 
8.  36871  PhD Nikola Ljubešić  Linguistics  Researcher  2014 - 2017  397 
9.  31844  PhD Senja Pollak  Linguistics  Researcher  2014 - 2017  290 
10.  33783  PhD Damjan Popič  Linguistics  Researcher  2015 - 2017  126 
11.  20453  PhD Špela Vintar  Linguistics  Researcher  2014 - 2015  265 
12.  24440  PhD Ana Zwitter Vitez  Linguistics  Researcher  2014 - 2017  118 
Organisations (2)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0106  Jožef Stefan Institute  Ljubljana  5051606000  90,742 
2.  0581  University of Ljubljana, Faculty of Arts  Ljubljana  1627058  97,992 
Abstract
The past decade has witnessed rapid growth of user-generated content, such as blogs, forums and social media. Such content offers an important source of information to diverse fields, such as social sciences, economics and computer science, both for research and business. But when dealing with user-generated content it is necessary to come to grips with the language of computer-mediated communication which is, due to social and technical characteristics, often very different from the standard. This language is characterized by colloquialisms and borrowings, dialect-specific phonetic orthography and syntax, specific abbreviations and fast uptake of new vocabulary, etc. Standard Slovene is well researched and supported with linguistic resources and tools. But there are no representative corpora for studies of non-standard language, no tools for its analysis and processing, and characteristics of non-standard language are hardly ever included in language descriptions, textbooks or school curricula. The proposed project aims to overcome this gap by developing an infrastructure and methodology for the analysis of user-generated content in (mostly non-standard) Slovene. The proposed project uses a combination of state-of-the-art methods from corpus and computational linguistics to enable a comprehensive study into a segment of the Slovene language, which is changing rapidly, gaining increasing importance in all our activities but has been, so far, ignored for various reasons. In the scope of the project we will compile a large and representative corpus containing Slovene tweets, blogs, Internet forums and comments on news articles and on Wikipedia entries. These text types cover a large portion of publicly available user-generated text. The corpus will be linguistically annotated with standardized spelling, lemma, part-of-speech, syntactic structure and names and will be freely available via a powerful concordancer to make it useful for theoretical and applied linguistic research. The corpus we will be used for a series of linguistic analyses, in particular a comparison of non-standard Slovene with standard written and spoken Slovene, a study of offensive language and three corpus-driven studies; of collocations, terms and semantic shifts in non-standard language. The project will also produce two datasets, a manually annotated corpus and a lexicon, which will be used to develop methods for automatic processing on non-standard texts. Based on the lexicon, a web dictionary of non-standard Slovene will also be produced, useful for teachers, students, linguists, lexicographers and the general public. At the end of the project, we will make the developed resources openly available for download under the Creative Commons license, to make them available for R&D in computational linguistics and other automatic data processing fields. The developed tools will be incorporated into a workflow construction and execution environment, and a prototype platform for the continuous construction of a monitor corpus will be developed. We plan to include the resources, workflows and platforms into the Slovene research infrastructure CLARIN after the end of the project, in order to ensure longevity of its results and to maintain and further develop them. The project also aims to disseminate its results through two workshops and the publication of a book. The developed resources, tools and methods will enable transfer of knowledge to all fields that deal with user-generated content. This will increase e-inclusion of Slovene speakers, who are often chained to foreign-language applications, so that Slovene can function and develop in the digital age. As the methodology for corpus construction and development of the tools will be language independent, it will also be useful for related languages that still lack them, giving the results an important multilingual dimension as well.
Significance for science
A thorough understanding of the characteristics of non-standard Slovene, which is becoming an increasingly frequent and important type of written communication in the information era. The extensive contrastive quantitative and qualitative corpus studies yielded valuable information about the language of user-generated content at all levels of linguistic description, enabling further research and development of applications in lexicography, sociolinguistics as well as language pedagogy. In addition, the methodological and instrumental apparatus of corpus linguistics can now be utilized and further developed in the course of the project, thus benefitting all further corpus-linguistic studies of Slovene. The development of automated methods for processing of user-generated content, which are not important only for Slovene but have international relevance as well because researchers working with other languages face similar challenges. State-of-the-art natural-language processing methods were implemented, such as statistical machine translation and machine learning. The proposed project can now boost the development of methods for identification and extraction of targeted user-generated content, word standardization, morphosyntactic and syntactic annotation, lemmatization and named-entity identification. In addition, improvements to integration of individual processes into a complex annotation model were achieved. The construction of state-of-the-art resources of non-standard Slovene is the most tangible result of the proposed project and comprises a large, contemporary and annotated corpus, and a lexical database and on-line dictionary of non-standard Slovene. The developed resources enable a comprehensive insight into the characteristics and development of the language used in computer-mediated communication and open possibilities for new research venues in linguistics and language technologies for Slovene, not only by project partners but also other researchers in Slovenia and abroad. Implementation and promotion of good scientific practices, which are still not the norm in Slovene science, especially in linguistics. As opposed to introspective linguistics and linguistic analysis based on opportunistically collected examples, this project promotes empirical, corpus linguistics; it structures linguistic resources in accordance with international standards rather than in undocumented ad-hoc formats; and makes the resources developed within the proposed project available under the Creative Commons license, allowing their open dissemination and full utilization, which enables replicability of the analyses, prevents double financing of similar research and promotes economic progress in order to move beyond the prevailing situation in Slovenia as described by Sˇtebe et al. (2013): “Except for some isolated cases [of allowing access to research data] this area is critically undernourished due to the prevailing culture of closing and monopolizing the data.”
Significance for the country
National Program for Language Policy 2014­2018 states: “The development of information and communication technologies in the last 10 years is creating a digital divide that makes the languages that are excluded from this development less attractive and competitive in a globally connected world. The digital divide separates the languages that are sufficiently present on the web and equipped with advanced digital resources and language technologies, from those with an increasing backlog due to the fast development of ICT technologies.” For this reason, one of the major objectives of the Program is “promoting the development of language technologies for Slovene, which includes the establishment of the necessary infrastructure and construction of freely accessible resources and tools”. This is also one of the main objectives of the proposed project, which will develop freely accessible resources and tools for the unexplored user­generated Slovene texts. The project partner JSI also hosts the Slovene CLARIN research infrastructure, which will enable permanent access and maintenance of the developed resources and tools. Slovenian companies from the field of ICT, such as semantic web, information mining, text mining, and text summarization have an increasing need to provide products based on processing Slovene language, and here they have to tackle non­standard language in user­ generated texts. Within this project, we will not prohibit the commercial use of the results, which will make the developed resources directly useful for companies, also increasing their competitiveness. Language, esp. in Slovenia, is often seen as either standard, correct and good or non­ standard, incorrect and bad. While it is true that in some communicative circumstances only standard language is appropriate, there are many others in which non­standard language is acceptable and predominantly used, which is why we need to be able to process and understand it. The proposed project will offer invaluable insights into this matter, achieved with contrastive linguistic research as well as very tangible and publicly available resources of non­standard Slovene that will be useful in primary and secondary education as well as in second­language learning. This will enhance the awareness and understanding of a wide range of Slovene language varieties and registers among students and foreigners as well as increase their communicative competence. Free availability and documented encoding of the lexical database will allow its inclusion in other lexical resources, such as the planned new dictionary of Slovene. With this, the results of the proposed project will be ready for integration into a major lexicographic project of national importance and thus have an impact on the entire language community. Direct impact of the results of the project in terms of enhanced support for ICT companies will indirectly have an impact on the whole society as well because speakers of Slovene will have access to products supporting the Slovene language.
Most important scientific results Annual report 2014, 2015, final report
Most important socioeconomically and culturally relevant results Annual report 2014, 2015, final report
Views history
Favourite