Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference

Code

L1-7542 (A) - included in ARIS records

Head

PhD Erik Štrumbelj

Period

3/1/2016 - 2/28/2019

Range in 2019

0.36 FTE

Science

Natural sciences and mathematics (1)
Engineering sciences and technologies (12)
Medical sciences (1)
Social sciences (3)
Humanities (7)

Reseacher status

Researcher (24)
Junior expert or technical associate (0)

Education

Doctoral degree (21)
Other (3)

Sex

Woman (5)
Man (19)

Status

Employed at RO (1)
Employed at RO and RRD (18)
No data on employment in RO (2)
Retired (3)

No. of publications

0 (1)
10–99 (6)
100–999 (14)
1,000–9,999 (3)

Projects / Programmes source: ARIS

Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference

Research activity

Code	Science	Field	Subfield
1.07.01	Natural sciences and mathematics	Computer intensive methods and applications	Algorithms

Code	Science	Field
P160	Natural sciences and mathematics	Statistics, operations research, programming, actuarial mathematics

Code	Science	Field
1.01	Natural Sciences	Mathematics

Keywords

applied statistics, Bayesian statistics, Markov Chain Monte Carlo, parallelization, graphical processing units, hierarchical models, clustering

Evaluation (metodology)

Evaluation of bibliographic research performance indicators according to ARIS methodology

Citations Citations for bibliographic records in COBIB.SI that are linked to records in citation databases

Organisations (5) , Researchers (24)

1539 University of Ljubljana, Faculty of Computer and Information Science

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	28779	PhD Zoran Bosnić	Computer science and informatics	Researcher	2016 - 2019	246
2.	33795	Rok Češnovar	Computer science and informatics	Researcher	2016 - 2019	23
3.	37645	PhD Jure Demšar	Computer science and informatics	Researcher	2018	118
4.	29485	PhD Jana Faganeli Pucer	Computer science and informatics	Researcher	2016 - 2019	58
5.	04242	PhD Igor Kononenko	Computer science and informatics	Researcher	2016 - 2019	478
6.	14565	PhD Matjaž Kukar	Computer science and informatics	Researcher	2016 - 2018	243
7.	22723	PhD Polona Oblak	Mathematics	Researcher	2016 - 2018	152
8.	15295	PhD Marko Robnik Šikonja	Computer science and informatics	Researcher	2016 - 2018	511
9.	33385	PhD Davor Sluga	Computer science and informatics	Researcher	2018 - 2019	39
10.	29486	PhD Erik Štrumbelj	Computer science and informatics	Head	2016 - 2019	128

0581 University of Ljubljana, Faculty of Arts

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	37430	PhD Vida Ana Politakis	Psychology	Young researcher	2016 - 2019	31
2.	17893	PhD Grega Repovš	Psychology	Researcher	2016 - 2019	520
3.	36162	PhD Anka Slana Ozimič	Neurobiology	Researcher	2016 - 2018	153

0587 University of Ljubljana, Faculty of Sport

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	14970	PhD Frane Erčulj	Educational studies	Researcher	2016 - 2019	631

0618 Research Centre of the Slovenian Academy of Sciences and Arts

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	27510	PhD Mateja Breg Valjavec	Geography	Researcher	2016 - 2019	223
2.	30791	PhD Rok Ciglič	Geography	Researcher	2016 - 2019	468
3.	13179	PhD Mauro Hrvatin	Humanities	Researcher	2016 - 2019	375
4.	07553	PhD Drago Kladnik	Geography	Researcher	2016 - 2019	1,175
5.	21464	PhD Blaž Komac	Geography	Researcher	2016 - 2019	656
6.	08294	PhD Drago Perko	Geography	Researcher	2016 - 2019	1,060
7.	22245	PhD Matija Zorn	Geography	Researcher	2016 - 2019	1,330

3346 OPTILAB d.o.o., information technology and business services

no.	Code	Name and surname	Research area	Role	Period	No. of publicationsNo. of publications
1.	28473	Štefan Furlan	Computer science and informatics	Researcher	2016	35
2.	36887	Gaber Terseglav	Computer science and informatics	Researcher	2016	0
3.	33290	PhD Gašper Zadnik	Computer science and informatics	Researcher	2016 - 2019	10

Abstract

It is difficult to overstate the importance of statistical data analysis in today's world: all the empirical sciences, health, finance, fraud detection, telecommunications, social networking, and marketing are just a few areas, which rely heavily on data and their analysis. While applied statistics, especially modern Bayesian statistics, have progressed tremendously and have become much more accessible, progress has recently been slowing down, because current state-of-the-art computation cannot handle the models and volumes of data we want to analyze today. The issue of inefficient statistical computation has recently been highlighted as one of the top 5 open problems in statistics. Our primary objective is to contribute to solving this problem by researching an approach to more efficient general-purpose computation and implementing the findings in a tool, which would allow us to analyze ever growing volumes of data at a reasonable cost. We plan to achieve this objective by automatically parallelizing the most expensive parts of general-purpose Markov Chain Monte Carlo computation algorithms (in particular, Metropolis-Hastings and Hamiltonian Monte Carlo) and using graphical processing units. As a result of our project, we anticipate at least 100-fold speedups at a low cost (less than €1.000,00). Furthermore, have attracted top researchers and experts from the University of Ljubljana, the Slovenian Academy of Sciences and Arts, and industry to participate in the project. Every data set and statistical inference problem we use to gain insight, develop, evaluate, and validate our methodology, will be a part of a relevant practical problem faced by Slovenian researchers. There have been successful attempts at efficient statistical computation for very limited cases, but what we are aiming for - general-purpose inference, which is automatically parallelized for highly efficient computation - is novel and has so far not been achieved. This makes the project extremely relevant both as a significant scientific achievement in the field of computation and due to the numerous practical benefits of low-cost accessible high-performance statistical inference. Indices from related work suggest that the speedups we are aiming for are achievable. While this is a research project and several technical details and implementation issues remain to be resolved, we are confident of the projects feasibility, as have a set of well-defined and directly measurable requirements, we laid out a clear plan on how to achieve them, and assembled a project team of experts from varied backgrounds with all the required knowledge and know-how. We also attracted co-financing from industry to supplement our budget and we will actively promote student participation. The main contributions of the project will be the theoretical research that leads to efficient computation, the practical implementation of this research into a software tool for general-purpose statistical computation, and, as a by-product, empirical research achievements in other fields of science made possible by our methodological research. Efficient computation will cut time and costs, which will directly benefit industry and, given the ubiquity and growing volumes of data, every-day life. And last, but not least, the collaboration between researchers, applied researchers, industry, and students will raise the general level of applied statistical knowledge, a field that is extremely underdeveloped in Slovenia.

Significance for science

Successful realization of the research project objectives in scientific terms is directly relevant to the fields of computationally intensive methods and statistical computing. Statistical computing has also been highlighted as the 2nd most important open problem in applied statistics. Our research is therefore also an important step forward in the field of applied statistics at the world-class level. The key contribution will be the research on how to automatically parallelize the MCMC methods for a general class of statistical models and a practical implementation of the research findings in a tool for efficient general-purpose statistical inference. While primary focus will be on functions that are relevant to statistics, our findings will be relevant to any field that deals with the computation of high-dimensional integrals.

The computational tools developed in this project will also have a secondary effect on the development in other scientific fields. Statistical inference has broad applicability and are the methodological foundation of all empirical investigation. The statistical analyses in the applied problems we have identified together with our project partners will be made more efficient, which will allowing us to investigate research questions faster. And, what is more important, it will allow Slovenian empirical scientists, which in most cases don't have supercomputers or substantial budgets for computation, to investigate new research questions that were until now prohibitive due to model complexity and/or scale of data. All our applications deal with open research questions in their respective field, so the each of them has the potential to lead to significant scientific progress. We anticipate at least one substantial research achievement in each of the following fields: geography, neuroscience, and sports.

Significance for the country

Successful realization of the project will have a strong impact on applied statistical analysis and will have long-term benefits to Slovenian research and industry. Our results will improve the flexibility of statistical inference and cut computational costs. Statistical analysis has as broad an applicability in industry as it does in the sciences - fraud detection, marketing, and finance are just some of the industries that rely heavily on data analysis.

Specifically, at least one of the application areas in the project will be in collaboration with our partner from industry, Optilab, where we plan for our methods to improve one of their services or products in either direct marketing or fraud detection. This will directly contribute to the quality of the partner's services, their known-how, and is very likely to lead to new employment opportunities and future projects.

The results of this project can just as easily be applied in health, medical prognostics and diagnostics, natural disaster prediction, etc... .and other similar fields where statistical analysis can have a strong positive impact on society. To give an example, our research group has in the past worked with the Institute of oncology, Ljubljana, on automated breast cancer recurrence prediction, leading to positive results on a small data set of approximately 1000 patients. However, today there are in Slovenia alone over 10.000 new cancer patients, each with a medical history, gene profile, test results, etc... which amounts to large volumes of data. Statistical analysis of this data could lead to substantial improvement in cancer prognosis or recurrence prediction and in turn to more efficient treatment, which would benefit patients as well as reduce healthcare costs. However, in order to analyze such data, we require efficient computation.

Furthermore, the project is a collaboration between computational and statistical scientists, researchers from other applied sciences, and a partner from industry, which will facilitate the flow of ideas. Several applications included in the project have substantial economic value, such as direct marketing, predicting natural disasters (landslides, etc...) and automated player scouting. These or potentially some of the other applications are likely to lead to new projects or a spin-off company.

Most important scientific results

Interim report, final report

Most important socioeconomically and culturally relevant results

Final report

Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference

Views history

Favourite

Advancement of computationally intensive methods for efficient modern general-purpose statistical analysis and inference

FRASCATI classification

CERIF classification

FORD classification

Confirmation required

Views history

Favourite