Projects / Programmes
Improving Reproducibility of Experiments and Reusability of Research Outputs in Complex Data Analysis
Code |
Science |
Field |
Subfield |
2.07.00 |
Engineering sciences and technologies |
Computer science and informatics |
|
Code |
Science |
Field |
P170 |
Natural sciences and mathematics |
Computer science, numerical analysis, systems, control |
Code |
Science |
Field |
1.02 |
Natural Sciences |
Computer and information sciences |
reproducible research; reuse of research results; machine learning; data mining; complex data analysis; semantic technologies;
Researchers (11)
no. |
Code |
Name and surname |
Research area |
Role |
Period |
No. of publicationsNo. of publications |
1. |
53798 |
Jure Brence |
Computer science and informatics |
Researcher |
2020 - 2022 |
22 |
2. |
36220 |
PhD Martin Breskvar |
Computer science and informatics |
Researcher |
2018 - 2022 |
36 |
3. |
11130 |
PhD Sašo Džeroski |
Computer science and informatics |
Researcher |
2018 - 2022 |
1,209 |
4. |
31050 |
PhD Dragi Kocev |
Computer science and informatics |
Researcher |
2018 - 2022 |
206 |
5. |
53530 |
Ana Kostovska |
Computer science and informatics |
Junior researcher |
2020 - 2022 |
45 |
6. |
28291 |
PhD Petra Kralj Novak |
Computer science and informatics |
Researcher |
2018 - 2022 |
130 |
7. |
36356 |
PhD Aljaž Osojnik |
Computer science and informatics |
Researcher |
2018 - 2022 |
47 |
8. |
27759 |
PhD Panče Panov |
Computer science and informatics |
Head |
2018 - 2022 |
157 |
9. |
38206 |
PhD Matej Petković |
Computer science and informatics |
Junior researcher |
2018 - 2020 |
67 |
10. |
34452 |
PhD Nikola Simidjievski |
Computer science and informatics |
Researcher |
2018 - 2022 |
58 |
11. |
39156 |
PhD Tomaž Stepišnik |
Computer science and informatics |
Junior researcher |
2018 - 2022 |
28 |
Organisations (1)
no. |
Code |
Research organisation |
City |
Registration number |
No. of publicationsNo. of publications |
1. |
0106 |
Jožef Stefan Institute |
Ljubljana |
5051606000 |
91,767 |
Abstract
The advances in science are heavily based on the premise of the concept of a trusted discovery, provided that the preformed research is done correctly, and reproducible by other scientists. In order to increase the reusability of research outputs, such as developed models and produced data, they should be Findable, Accessible, Interoperable and Reusable (FAIR principles). The main point of the FAIR is to ensure that research outputs are reusable and will actually be used by others, thus becoming more valuable. The DG for Research and Innovation of the EC has adopted the reusability of research data as one of their priorities, which provided the rapid endorsement of the FAIR principles by different stakeholders. The research outputs that wish to fulfill the FAIR principles must be represented with a wide accepted machine-readable framework. Currently, a popular solution to data sharing that fulfills the FAIR requirements is the use of semantic web technologies.
Complex data analysis methods, originating from machine learning (ML) and data mining (DM), are increasingly being used in applications from various domains of science (e.g., life sciences, space research, etc). In order to provide reproducibility of experiments (e.g., executions of methods) and reuse of research outputs (e.g., predictive models), one needs to formally describe the entities involved in the process of analysis, and store them together with their descriptions (e.g., metadata) as a digital objects in a database like structure. Having a “semantically aware” stores of entities for complex data analytics enhanced with automatic reasoning capabilities, would be beneficial for improving reproducibility of experiments and reuse of research outputs. In this way we would move closer towards a FAIR data analysis process.
The main objective of the proposed project is to improve the repeatability of experiments and reusability of research outputs in complex data analysis. We will address this objective by combining approaches and ideas from the areas of complex data analytics, ontologies for science, semantic web and inductive databases. More specifically, we will develop a modular system for executing complex data analysis experiments, and semantically annotating, storing, querying and reusing their outputs. To meet the project main objective, we plan to: (1) design, implement and populate ontologies for complex data analysis to be used for semantic annotation; (2) design and implement a prototype system for storing semantically annotated data, experiments and models; (3) develop querying strategies and test the querying capabilities of the prototype system, and (4) test the developed system in different use-case scenarios from various domains, such as machine learning, lifesciences, space research and chemoinformatics.
The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models produced by the analytics methods. This is of particular particular importance for the application domains that heavily use data analytics tools in their work. The proposed project will also have a large impact in the context of automating data science. The experiments would be repeatable, since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck. Finally, in a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.
Significance for science
Within the proposed project, we will develop a new architecture for executing, storing, semantically annotating, and querying experiments and results in complex data analytics, to improve the reproducibility of experiments and reusability of data, experiments and models. This is a highly relevant topic that can have large impact in application domains that heavily use data analytics tools in their work. For example, given our involvement in the Medical Informatics Platform of the H2020 FET Flagship Human Brain Project, we can instantiate the architecture to support reproducibility and reuse for the tasks of biomarker discovery and biological signatures of diseases discovery. Moreover, we are also involved in an Interreg project with an Italian partner ICGEB, Trieste, where we will be able to instantiate the architecture to support reproducibility and reuse for the task of disease modeling in the context of building a cross-border platform for validated biotech industry kits.
The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models, which is of particular importance for the application areas that use analytics as a service. In a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.
The proposed project can have a large impact in the context of automating data science. The experiments would be repeatable since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck.
Finally, the project results will be properly disseminated and communicated to provide maximal outreach. This will be done by attending various workshops and conferences as well as by publishing the research results in the journals. The proposed architecture and the produced resources (ontologies, RDF stores) will be made publically available to ensure larger outreach. The obtained results will also be communicated to external stakeholders in Europe (such as the Research Data Alliance) that can help in increasing the outreach of the project and exploit the obtained knowledge and resources.
Significance for the country
Within the proposed project, we will develop a new architecture for executing, storing, semantically annotating, and querying experiments and results in complex data analytics, to improve the reproducibility of experiments and reusability of data, experiments and models. This is a highly relevant topic that can have large impact in application domains that heavily use data analytics tools in their work. For example, given our involvement in the Medical Informatics Platform of the H2020 FET Flagship Human Brain Project, we can instantiate the architecture to support reproducibility and reuse for the tasks of biomarker discovery and biological signatures of diseases discovery. Moreover, we are also involved in an Interreg project with an Italian partner ICGEB, Trieste, where we will be able to instantiate the architecture to support reproducibility and reuse for the task of disease modeling in the context of building a cross-border platform for validated biotech industry kits.
The proposed research will significantly advance the state-of-the-art in the general area of computer science, the specific area of machine learning and data mining, and particularly for the topic of complex data analytics. It will develop new architecture for semantically aware experimentation. It will also improve storing, reusing, revising and querying of models, which is of particular importance for the application areas that use analytics as a service. In a wider societal context, the project will increase Slovenia’s research and innovation potential in this area of extreme practical importance.
The proposed project can have a large impact in the context of automating data science. The experiments would be repeatable since they are performed in a sound documented fashion, as there will be an architecture available to perform such an analysis. Current experimentation architectures are applicable to a very limited set of tasks and do not deal with querying, collaborative validation and revision of models, which represents a serious development bottleneck.
Finally, the project results will be properly disseminated and communicated to provide maximal outreach. This will be done by attending various workshops and conferences as well as by publishing the research results in the journals. The proposed architecture and the produced resources (ontologies, RDF stores) will be made publically available to ensure larger outreach. The obtained results will also be communicated to external stakeholders in Europe (such as the Research Data Alliance) that can help in increasing the outreach of the project and exploit the obtained knowledge and resources.
Most important scientific results
Interim report
Most important socioeconomically and culturally relevant results