Loading...
Projects / Programmes source: ARIS

Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis

Research activity

Code Science Field Subfield
2.07.00  Engineering sciences and technologies  Computer science and informatics   

Code Science Field
P170  Natural sciences and mathematics  Computer science, numerical analysis, systems, control 

Code Science Field
1.02  Natural Sciences  Computer and information sciences 
Keywords
reproducible research; reuse of research results; machine learning; data mining; complex data analysis; semantic technologies;
Evaluation (rules)
source: COBISS
Researchers (11)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  53798  Jure Brence  Computer science and informatics  Researcher  2020 - 2022  21 
2.  36220  PhD Martin Breskvar  Computer science and informatics  Researcher  2018 - 2022  36 
3.  11130  PhD Sašo Džeroski  Computer science and informatics  Researcher  2018 - 2022  1,204 
4.  31050  PhD Dragi Kocev  Computer science and informatics  Researcher  2018 - 2022  204 
5.  53530  Ana Kostovska  Computer science and informatics  Junior researcher  2020 - 2022  41 
6.  28291  PhD Petra Kralj Novak  Computer science and informatics  Researcher  2018 - 2022  130 
7.  36356  PhD Aljaž Osojnik  Computer science and informatics  Researcher  2018 - 2022  47 
8.  27759  PhD Panče Panov  Computer science and informatics  Head  2018 - 2022  155 
9.  38206  PhD Matej Petković  Computer science and informatics  Junior researcher  2018 - 2020  65 
10.  34452  PhD Nikola Simidjievski  Computer science and informatics  Researcher  2018 - 2022  58 
11.  39156  PhD Tomaž Stepišnik  Computer science and informatics  Junior researcher  2018 - 2022  28 
Organisations (1)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0106  Jožef Stefan Institute  Ljubljana  5051606000  90,724 
Abstract
The advances in science are heavily based on the premise of the concept of a trusted discovery, provided that the preformed research is done correctly, and reproducible by other scientists. In order to increase the reusability of research outputs, such as developed models and produced data, they should be Findable, Accessible, Interoperable and Reusable (FAIR principles). The main point of the FAIR is to ensure that research outputs are reusable and will actually be used by others, thus becoming more valuable. The DG for Research and Innovation of the EC has adopted the reusability of research data as one of their priorities, which provided the rapid endorsement of the FAIR principles by different stakeholders. The research outputs that wish to fulfill the FAIR principles must be represented with a wide accepted machine-readable framework. Currently, a popular solution to data sharing that fulfills the FAIR requirements is the use of semantic web technologies. Complex data analysis methods, originating from machine learning (ML) and data mining (DM), are increasingly being used in applications from various domains of science (e.g., life sciences, space research, etc). In order to provide reproducibility of experiments (e.g., executions of methods) and reuse of research outputs (e.g., predictive models), one needs to formally describe the entities involved in the process of analysis, and store them together with their descriptions (e.g., metadata) as a digital objects in a database like structure. Having a “semantically aware” stores of entities for complex data analytics enhanced with automatic reasoning capabilities, would be beneficial for improving reproducibility of experiments and reuse of research outputs. In this way we would move closer towards a FAIR data analysis process. The main objective of the proposed project is to improve the repeatability of experiments and reusability of research outputs in complex data analysis. We will address this objective by combining approaches and ideas from the areas of complex data analytics, ontologies for science, semantic web and inductive databases. More specifically, we will develop a modular system for executing complex data analysis experiments, and semantically annotating, storing, querying and reusing their outputs. To meet the project main objective, we plan to: (1) design, implement and populate ontologies for complex data analysis to be used for semantic annotation; (2) design and implement a prototype system for storing semantically annotated data, experiments and models; (3) develop querying strategies and test the querying capabilities of the prototype system, and (4) test the developed system in different use-case scenarios from various domains, such as machine learning, lifesciences, space research and chemoinformatics. The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models produced by the analytics methods. This is of particular particular importance for the application domains that heavily use data analytics tools in their work. The proposed project will also have a large impact in the context of automating data science. The experiments would be repeatable, since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, in a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.
Significance for science
Within the proposed project, we will develop a new architecture for executing, storing, semantically annotating, and querying experiments and results in complex data analytics, to improve the reproducibility of experiments and reusability of data, experiments and models. This is a highly relevant topic that can have large impact in application domains that heavily use data analytics tools in their work.  For example, given our involvement in the Medical Informatics Platform of the H2020 FET Flagship Human Brain Project, we can instantiate the architecture to support reproducibility and reuse for the tasks of biomarker discovery and biological signatures of  diseases discovery. Moreover, we are also involved in an Interreg project with an Italian partner ICGEB, Trieste, where we will be able to instantiate the architecture to support reproducibility and reuse for the task of disease modeling in the context of  building a cross-border platform for validated biotech industry kits. The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models, which is of particular importance for the application areas that use analytics as a service. In a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance. The proposed project can have a large impact in the context of automating data science. The experiments would be repeatable since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, the project results will be properly disseminated and communicated to provide maximal outreach. This will be done by attending various workshops and conferences as well as by publishing the research results in the journals. The proposed architecture and the produced resources (ontologies, RDF stores) will be made publically available to ensure larger outreach. The obtained results will also be communicated to external stakeholders in Europe (such as the Research Data Alliance) that can help in increasing the outreach of the project and exploit the obtained knowledge and resources.
Significance for the country
Within the proposed project, we will develop a new architecture for executing, storing, semantically annotating, and querying experiments and results in complex data analytics, to improve the reproducibility of experiments and reusability of data, experiments and models. This is a highly relevant topic that can have large impact in application domains that heavily use data analytics tools in their work.  For example, given our involvement in the Medical Informatics Platform of the H2020 FET Flagship Human Brain Project, we can instantiate the architecture to support reproducibility and reuse for the tasks of biomarker discovery and biological signatures of  diseases discovery. Moreover, we are also involved in an Interreg project with an Italian partner ICGEB, Trieste, where we will be able to instantiate the architecture to support reproducibility and reuse for the task of disease modeling in the context of  building a cross-border platform for validated biotech industry kits. The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models, which is of particular importance for the application areas that use analytics as a service. In a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance. The proposed project can have a large impact in the context of automating data science. The experiments would be repeatable since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, the project results will be properly disseminated and communicated to provide maximal outreach. This will be done by attending various workshops and conferences as well as by publishing the research results in the journals. The proposed architecture and the produced resources (ontologies, RDF stores) will be made publically available to ensure larger outreach. The obtained results will also be communicated to external stakeholders in Europe (such as the Research Data Alliance) that can help in increasing the outreach of the project and exploit the obtained knowledge and resources.
Most important scientific results Interim report
Most important socioeconomically and culturally relevant results
Views history
Favourite