Loading...
Projects / Programmes source: ARIS

Literature-based discovery

Research activity

Code Science Field Subfield
5.13.00  Social sciences  Information science and librarianship   

Code Science Field
H100  Humanities  Documentation, information, library science, archivistics 

Code Science Field
5.08  Social Sciences  Media and communications 
Keywords
Information science, bibliometry, scientometrics, medical informatics, literature-based discovery, MEDLINE
Evaluation (rules)
source: COBISS
Researchers (1)
no. Code Name and surname Research area Role Period No. of publicationsNo. of publications
1.  26484  PhD Andrej Kastrin  Medical sciences  Head  2018 - 2020  148 
Organisations (1)
no. Code Research organisation City Registration number No. of publicationsNo. of publications
1.  0381  University of Ljubljana, Faculty of Medicine  Ljubljana  1627066  48,195 
Abstract
Literature-based discovery (LBD) in an important research problem. LBD is a text mining technology for automatically generating research hypotheses. The aim is uncovering hidden, previously unknown relationships from existing literature. In the proposed project we translate the problem of LBD to link prediction in complex network. Complex networks are evolving structures that change in time. The fundamental problem in complex network research in understanding of link formation between nodes. Formally, link prediction refers to the discovery of associations between nodes that are not directly connected in the current snapshot of a network but are connected in the future. In this project we plan to examine link prediction from the novel perspective of LBD. We will mimic the the process of LBD as a classification problem, where features will be represented by topological and semantic similarity measures between biomedical concepts. Objectives of the proposed project could be summarized as follows: (i) theoretical analysis of MEDLINE bibliographic database and SemMed database of semantic relations, (ii) development of programming tools for link prediction in large-scale networks, (iii) extension and validation of link prediction methodology for LBD on heterogeneous networks using the concept of meta-path, (iv) implementation of network embedding methodology for LBD link prediction, and finally (v) development of Web application for real-time LBD on the basis of link prediction methodology. Our preliminary analysis show that there is a lot of interest for LBD in professional public, especially in the domain of life sciences. However, previous approaches to LBD have generated a large number of false-positive results and thereby made difficult substantive interpretation of the results. The proposed project solves this problem. As a main data source we will use MEDLINE, the largest literature database in the biomedical domain. We will also employ SemMedDB, a database of semantic predications parsed from MEDLINE. To set up a network we will consider biomedical concept as node in the network. An edge between two nodes will be defined if they appear together in the same MEDLINE citation. Our approach to link prediction relies on the assumption that similar nodes are more likely to establish an edge in the future. On the derived similarity measures we will use unsupervised and supervised machine learning methods and thus assess the ability of prediction of new relations. The majority of link prediction studies have been performed on homogenous networks. However, a heterogeneous network, such as SemMedDB, is a network containing multiple types of nodes and edges. Heterogeneous networks also provide non-lexical information (e.g., author and citation relations). We will use the concept of meta-path to encode semantic information in SemMedDB network. For feature extraction we will use state-of-the-art similarity mesures, including PathCount, RandomWalk, PathSim, and HeteSim. Statistical learning will utilize different classification algorithms (e.g., random forest, extreme gradient boosting) for link prediction on embedded networks. In order to apply machine learning on networks, it is essential to learn informative node representations. Representation learning approaches offer a powerful alternative to traditional feature engineering. The general idea behind representation learning is to learn a mapping that embed nodes as points in a low-dimensional space. The goal is to optimize this mapping so that geometric relationships in this space reflect the structure of the original network. In this project we will examine different state-of-the-art methods for network embedding in order to perform link prediction (e.g., factorization methods, DeepWalk, node2vec). The workplan of the proposed research is designed to achieve the project’s goals in an efficient and effective manner. Workplan is partitioned into six work packages, each with sever
Significance for science
Proposed project proposes an innovative combination of statistical and knowledge-based techniques to improve literature-based discovery (LBD) process. According to our preliminary results, it provides better capability in discovering latent associations in the literature that may be too complex to be modeled using any existing approach to LBD. Special attention will be paid to performance evaluation, the issue which is neglected in most current approaches to LBD. With the innovative integration of methodologies from different research fields (scientometrics, link prediction, and machine learning), the proposed project offers new and fresh perspective on how the LBD problem could be addressed. However, the importance of new LBD technology is even greater because they serve as a basis for other scientific fields (e.g., question-answering, gene discovery, drug repurposing). In particular, the following stakeholders will benefit from the outcome of the project: (i) researchers in biomedicine and biology interested in early detection of relationships between scientific instances, (ii) curators and maintainers of biomedical databases and resources, and (iii) entrepreneurs, seeking business opportunities in high-tech bioscience. We believe that the results of the proposed project will contribute significantly to the global knowledge in the field of LBD, to a further establishment of Ljubljana school of LBD on the European and global scale and to the transfer of scientific knowledge into practice (especially in the field of life sciences). Results will also significantly contribute to the consolidation of new research areas, such as temporal relational data mining and bibliomics, an emerging scientific discipline that deals with causative modeling based on textual data.
Significance for the country
Proposed project proposes an innovative combination of statistical and knowledge-based techniques to improve literature-based discovery (LBD) process. According to our preliminary results, it provides better capability in discovering latent associations in the literature that may be too complex to be modeled using any existing approach to LBD. Special attention will be paid to performance evaluation, the issue which is neglected in most current approaches to LBD. With the innovative integration of methodologies from different research fields (scientometrics, link prediction, and machine learning), the proposed project offers new and fresh perspective on how the LBD problem could be addressed. However, the importance of new LBD technology is even greater because they serve as a basis for other scientific fields (e.g., question-answering, gene discovery, drug repurposing). In particular, the following stakeholders will benefit from the outcome of the project: (i) researchers in biomedicine and biology interested in early detection of relationships between scientific instances, (ii) curators and maintainers of biomedical databases and resources, and (iii) entrepreneurs, seeking business opportunities in high-tech bioscience. We believe that the results of the proposed project will contribute significantly to the global knowledge in the field of LBD, to a further establishment of Ljubljana school of LBD on the European and global scale and to the transfer of scientific knowledge into practice (especially in the field of life sciences). Results will also significantly contribute to the consolidation of new research areas, such as temporal relational data mining and bibliomics, an emerging scientific discipline that deals with causative modeling based on textual data.
Most important scientific results Interim report, final report
Most important socioeconomically and culturally relevant results Final report
Views history
Favourite