Loading...
Projects / Programmes source: ARIS

Slovene scientific texts: resources and description

Research activity

Code Science Field Subfield
6.05.02  Humanities  Linguistics  Theoretical and applied linguistics 

Code Science Field
H350  Humanities  Linguistics 

Code Science Field
6.02  Humanities  Languages and Literature 
Keywords
academic language corpus terminology language technologies
Evaluation (rules)
source: COBISS
Researchers (11)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  27674  PhD Špela Arhar Holdt  Linguistics  Researcher  2016 - 2018  236 
2.  30672  PhD Maja Bitenc  Linguistics  Researcher  2017 - 2018  60 
3.  23982  PhD Borko Bošković  Computer science and informatics  Researcher  2016 - 2018  230 
4.  36914  PhD Jaka Čibej  Linguistics  Researcher  2016  152 
5.  05023  PhD Tomaž Erjavec  Linguistics  Head  2016 - 2018  636 
6.  36341  Marko Ferme  Computer science and informatics  Researcher  2016 - 2018  72 
7.  26294  PhD Darja Fišer  Linguistics  Researcher  2016 - 2018  412 
8.  26166  PhD Simon Krek  Linguistics  Researcher  2016 - 2018  373 
9.  36871  PhD Nikola Ljubešić  Linguistics  Researcher  2016 - 2018  397 
10.  20482  PhD Nataša Logar  Linguistics  Researcher  2016 - 2018  354 
11.  06823  PhD Milan Ojsteršek  Computer science and informatics  Researcher  2016 - 2018  526 
Organisations (4)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0106  Jožef Stefan Institute  Ljubljana  5051606000  90,664 
2.  0581  University of Ljubljana, Faculty of Arts  Ljubljana  1627058  97,831 
3.  0582  University of Ljubljana, Faculty of Social Sciences  Ljubljana  1626957  40,391 
4.  0796  University of Maribor, Faculty of Electrical Engineering and Computer Science  Maribor  5089638003  27,536 
Abstract
The development and use of Slovene academic language at universities and in research is one of the central questions of the Slovene language policy. The problem is highlighted in the National Programme for Language Policy of the Republic of Slovenia 2014–2018 and a number of European studies also draw attention to the impact that the knowledge and development of academic discourse have on language vitality. It is therefore of fundamental importance to develop contemporary reference language resources that will help empower Slovene academic language and to undertake comprehensive research based on a representative sample of such language. In recent years, Slovene universities have started to establish institutional repositories of scientific publications, containing various types of texts from PhD theses to scientific and professional papers. An important milestone is the establishment of the National Portal for Open Science, http://openscience.si/, launched in 2013, which aggregates access to the digital libraries of individual universities. The portal already offers access to over 123,000 Slovene language publications from a wide range of disciplines. These publications are a highly valuable but so far completely unused source of data on Slovene academic writing, including terminological data. The goal of the project is to overcome these limitations in several ways. First, it will compile a large corpus of Slovene academic writing containing texts harvested from the Open Science portal. The texts will be extracted from their source (usually PDF) format, which involves developing methods for text clean-up and structure extraction, and up-conversion to a uniform and standardised XML representation. The corpus will be linguistically annotated, with new tools and resources developed to improve the quality of the annotations. Text classification and keyphrase extraction methods will be developed as well, in order to enhance the usability of the Open Science portal by allowing better faceted search and recommender systems for university librarians entering the publications into the repositories. The corpus will serve as the basis for studies in terminology extraction. The extracted term candidates will be exported to a public online dictionary viewer and editor, so that Slovene scientific communities from a range of subject fields will be able to engage in the management of their terminologies. A very important aspect of the work undertaken in the project will be the first empirically based study of Slovene academic discourse, founded on a representative corpus. Data usability studies and in-depth interviews will also be conducted in an attempt to determine the process and obstacles for academic writing in Slovene, resulting in an online manual of style for academic writing in Slovene. The project will make its results as widely available as possible: the produced language resources and tools will be made freely and openly available to the wider research community, which will also improve the state-of-the art of corpus linguistics, digital humanities, and language technologies for Slovene. The resources will be archived in the repository of the research infrastructure CLARIN.SI, which will undertake the maintenance of the corpus after the close of the project. Furthermore, the project will engage with the Slovene scientific community through workshops and a conference. The project will be conducted by ten researchers from four academic institutions with distinct but complementary expertise to attain its goals: to strengthen Slovene academic language; to make Slovene better equipped for functioning in the information society; and to promote open dissemination of scientific results.
Significance for science
Linguistics: Slovene still lacks a comprehensive description of - as well as the resources for the study of - its academic language, and the proposed project will fill a great gap in this area. The research will bring new insights, approaches and activities to Slovene terminology and terminography. With the empowerment of academic and other interested groups collaborating on the terminology management of their scientific fields on a uniform terminological portal, the production of term descriptions will be strengthened. The developed language resources will also facilitate the traditional terminographic work and enable new, interdisciplinary analyses of specialized vocabulary. From a linguistic point of view the proposed research is also important because it will help to improve the language competences of university graduates and will offer them resources with which their academic and technical writing in Slovene will be easier and more successful. Digital humanities: This field, which combines the humanities with modern computer technology and digital resources, is relatively new but internationally very active. In Slovenia it is still in its infancy, with few practitioners and almost no university courses, much less degrees. The project results, in particular the concordancer-available and downloadable corpus, the terminology extraction web-service, available text-processing tools, and the web-based collaborative terminography, will strengthen this field, esp. in connection with the outreach activities planned in the project. Language technologies: The project will generate scientific results in this area mostly due to the application of novel machine learning methods to problems that have traditionally been approached either by rule-based (e.g. for terminology extraction) or specialised statistical methods (such as HMM). The project will develop novel methods for morphosyntactic tagging and lemmatisation, in noisy text clean-up with context-aware CSMT, and in structure extraction from PDF documents. In computational terminology we will be able to report on advances in identification, extraction, structuring and presentation of multilingual terminological knowledge from semi-structured resources. We also expect contributions to science in the area of language data encoding, esp. in connection with applications of the TEI. If the preceding is novel also internationally, we expect even more substantial scientific advances in the area of Slovene language processing: a new basic linguistic annotation tool chain, document classification, structure extraction, metadata enrichment, terminology extraction linking and keyphrase identification.
Significance for the country
The most visible impact of the project will be the setting up of terminologies on a public portal. The terminologies will give scholars and especially graduate and postgraduate students writing academic and professional papers or theses in Slovene a much needed web access to domain-specific terms for a wide variety of subject fields. As the portal will also offer editing facilities, it will encourage collaborative lexicography, where users can improve and add entries to the glossaries. Given the rapid development of scientific fields, this is a prerequisite for the long term viability of terminological dictionaries. An important direct consequence of the project will also be the improved usability of the Open Science portal due to the better classification of the aggregated texts and automatic generation of keywords that will be developed in the project. This will enable better recommender systems for the library staff entering the classification and (correcting) keywords for new publications to the digital libraries, as well as significantly improve the faceted search over the texts on the portal. The project will develop a complete tool chain to turn PDF documents into clean, structured and annotated text. This functionality can be implemented directly in the Open Science portal, which could then offer sophisticated full-text search over its text, informed by linguistic annotation, such as lemmatisation and markup of terms. Further levels of functionality can then be envisioned, already present in some services for English, such as bibliography identification, export and linking, similar documents suggestions, cross-linking of documents, etc. As Slovenia is now introducing the mandatory deposit of PDFs of all published scientific paper, the importance of the above functionalities will only increase in the future. The freely available web-based concordancer hosting the developed corpus of Slovene academic writing will enable a rich search and display of terms as well as general language used in academic texts. This will facilitate not only further linguistic studies of academic writing but also fact-finding in the body of contained scientific texts; until the Open Science portal is equipped with full-text search, the corpus can serve as a proxy for this functionality. The project will also produce a number of open-source language technology tools and resources for Slovene, which will either outperform existing ones (text correction, structure extraction, tagging, lemmatisation), or will address so far completely missing functionalities (term and keyphrase extraction, text classification). Furthermore, the project will compile reference annotated datasets of Slovene, a key resource for training language analysis tools. These tools and resources will be directly accessible for use by other researchers, and, where possible, for commercial use as awell. This will substantially facilitate further development of language technologies for Slovene.
Most important scientific results Interim report, final report
Most important socioeconomically and culturally relevant results Interim report, final report
Views history
Favourite