BioMining: Data Mining for Biomedical Problems

Help

Back »

Details of project

Identifier

111710

Type

Principal investigator

Buza, Krisztián Antal

Title in Hungarian

BioMining: Gépi tanulás orvosbiológiai feladatokra

Title in English

BioMining: Data Mining for Biomedical Problems

Keywords in Hungarian

gépi tanulás, osztályozás, link prediction, mátrix kitöltés, orvosbiológiai adatok, idősorok, electroencephalográf (EEG), electrokardiográf (EKG), gene expression data

Keywords in English

machine learning, classification, link prediction, matrix completion, biomedical data, time series, electroencephalograph (EEG), electrocardiograph (ECG), gene expression data

Discipline

Information Technology (Council of Physical Sciences)	100 %
Ortelius classification: Informatics

Panel

Informatics and Electrical Engineering

Department or equivalent

Brain Imaging Centre (Research Center of Natural Sciences)

Starting date

2014-09-01

Closing date

2017-08-31

Funding (in million HUF)

15.434

FTE (full time equivalent)

1.50

state

closed project

Summary in Hungarian

A kutatás összefoglalója, célkitűzései szakemberek számára
Itt írja le a kutatás fő célkitűzéseit a témában jártas szakember számára.
A projekt során gépi tanuláson alapuló eljárásokat fogunk fejleszteni orvosbiológiai feladatokra. Konkrétan három területtel fogunk foglalkozni: (i) genetikai adatok osztályozása, (ii) orvosi idősorok (elektrokardiográf, elektroencephalográf adatok) osztályozása, és (iii) új kapcsolatok keresése (link prediction) orvosbiológiai hálózatokban.

Mi a kutatás alapkérdése?
Ebben a részben írja le röviden, hogy mi a kutatás segítségével megválaszolni kívánt probléma, mi a kutatás kiinduló hipotézise, milyen kérdéseket válaszolnak meg a kísérletek.
Az orvosbiológiai adatokhoz kapcsolódó osztályozási és kapcsolatkeresési (link prediction) feladatok sikeres megoldásának két alapvető feltétele: (i) olyan osztályozó algoritmusok fejlesztése, amelyek az adatok lényeges tulajdonságait -- többek között: nagy dimenziószám (pl. gene expression adatok esetén), kiegyensúlyozatlan osztályeloszlás, zaj, bizonytalanság (uncertain data), hiányzó értékek, hierarchikus viszonyban lévő osztályok -- egyidejűleg veszik figyelembe, valamint (ii) az orvosi/biológiai szaktudás integrálása az algoritmusba. A létező eljárások a fenti kihívásokra (pl. nagy dimenziószám, kiegyensúlyozatlan osztályeloszlás, hiányzó értékek, stb.) általában egyenként próbálnak választ adni. Ezzel szemben, mi ebben a projektben olyan algoritmusok fejlesztését tűzzük ki célul, amelyek olyan orvosbiológiai adatok jó minőségű osztályozására is képesek, amelyekben a fenti kihívások egyszerre vannak jelen. Friss kutatási eredmények alapján a csomósodás-alapú (hubness-aware) osztályozók jól teljesítenek nagydimenziós térbeli adatok, egyenletlen osztályeloszlású adatok (class-imbalanced data) és idősorok osztályozása esetén egyaránt. Ezért az eljárások fejlesztése során különös figyelemmel fogjuk kísérni a hubness-aware gépi tanulás paradigmáját, és várhatóan ehhez (is) kapcsolódó eljárásokat fogunk fejleszteni.

Mi a kutatás jelentősége?
Röviden írja le, milyen új perspektívát nyitnak az alapkutatásban az elért eredmények, milyen társadalmi hasznosíthatóságnak teremtik meg a tudományos alapját. Mutassa be, hogy a megpályázott kutatási területen lévő hazai és a nemzetközi versenytársaihoz képest melyek az egyediségei és erősségei a pályázatának!
A projekt során fejlesztendő osztályozó eljárások többek között az orvosi diagnosztikát segíthetik. Génkifejeződési (gene expression) adatok osztályozása például a rák különöböző típusainak és altípusainak diagnosztikáját támogathatja, míg az EKG adatok elemzése a szív- és érrendszeri megbetegedések korai felismeréséhez járulhat hozzá. Ennek jelentősége egyrészt abban áll, hogy az említett betegségtípusok a vezető halálozási okok közt szerepelnek Magyarországon és a fejlett világban is. Másrészt például a mellrák (osztrogén receptor státusz szerinti) különböző altípusai más-más kezelést igényelnek, ugyanazon kezelés, amely az egyik altípus során eredményes, egy másik altípus esetében akár káros is lehet. Fontos tehát a betegség altípusának a lehető legpontosabb felismerése. Az EEG (elektroencephalográf) adatok osztályozása az orvosi diagnosztika (pl. epilepszia) mellett számos gyakorlati alkalmazásban is jelentős, úgy mint súlyosan bénult emberek számára készített EEG headsettel vezérelhető webböngészők és gépelést (és ezáltal a külvilággal történő kommunikációt) segítő eszközök, illetve több órás vagy egész napos utazások során gépjárművezetők éberségét mérő, EEG-n alapuló eszközök. Kapcsolatkeresési feladatok sikeres megoldása többek között a gyógyszerek molekuláris szintű hatásmechanizmusának alaposabb megértését segítheti drug-target interakciók predikciója révén. A kapcsolatkeresés révén új hipotézisek fogalmazhatók meg arra vonatkozóan, hogy a már létező, engedélyezett, és adott betegségre használatban lévő gyógyszerek esetlegesen más betegségek gyógyítására is képesek lehetnek (reuse of drugs). A projekt az innovációs folyamat első lépésére, algoritmusok, elemző eljárások kidolgozására összpontosít, ennek eredményeit tudományos publikációkban (folyóiratcikkek, konferencia megjelenések) fogjuk összegezni, miközben bízunk abban, hogy sikerül felkelteni az alkalmazók (orvosi műszergyártók, gyógyszerfejlesztők, stb.) figyelmét, és ezáltal a projekt eredményei hosszabb távon (a projekt lezárása után is) hasznosulnak majd.

A kutatás összefoglalója, célkitűzései laikusok számára
Ebben a fejezetben írja le a kutatás fő célkitűzéseit alapműveltséggel rendelkező laikusok számára. Ez az összefoglaló a döntéshozók, a média, illetve az érdeklődők tájékoztatása szempontjából különösen fontos az NKFI Hivatal számára.
Orvosbiológiai adatokhoz kapcsolódó, automatikusan, számítógéppel megoldandó felismerési feladatok közös elméleti háttere az osztályozás. Ilyen felismerési feladatok többek között diagnosztikai területen merülnek fel: pl. mellrák vagy más betegség különböző altípusainak felismerése génkifejeződés-adatok alapján, vagy szívműködés rendellenességeinek felismerése EKG (elektrokardiográf) jelek alapján. Az EEG (elektroencephalográf) adatok osztályozása az orvosi diagnosztika (pl. epilepszia) mellett számos gyakorlati alkalmazásban is jelentős, úgy mint súlyosan bénult emberek számára készített EEG headsettel vezérelhető webböngészők és gépelést (és ezáltal a külvilággal történő kommunikációt) segítő eszközök, illetve több órás vagy egész napos utazások során gépjárművezetők éberségét mérő, EEG-n alapuló eszközök. Kapcsolatkeresési feladatok sikeres megoldása többek között a gyógyszerek molekuláris szintű hatásmechanizmusának alaposabb megértését segítheti drug-target interakciók predikciója révén. A kapcsolatkeresés révén új hipotézisek fogalmazhatók meg arra vonatkozóan, hogy a már létező, engedélyezett, és adott betegségre használatban lévő gyógyszerek esetlegesen más betegségek gyógyítására is képesek lehetnek (reuse of drugs). Agyhullámokhoz kapcsolódó felismerési feladatok jelentőségét külön kiemeli Obama elnök által meghírdetett BRAIN kezdeményezés, amely várhatóan új lendületet ad az agykutatásnak. A projektben eljárásokat fogunk fejleszteni az említett osztályozási és kapcsolatkeresési feladatokra. Az automatikus felismerő eljárások az orvost nem helyettesíthetik, de orvosi műszerekbe beépítve az orvos munkáját nagyban segíthetik.

Summary

Summary of the research and its aims for experts
Describe the major aims of the research for experts.
In this project, we address data mining and machine learning problems related to biomedical data. In particular, we focus on classification and link prediction algorithms for biomedical problems. We propose to develop new methods that are expected to contribute to medical diagnostic tasks as well as drug-target prediction which is one of the primary data mining problems targeting better understanding of the chemical mechanisms of drugs. Successful drug-target prediction is also expected to contribute to the re-use of existing drugs in new contexts.
In particular, we will focus on (i) the development of new methods for classification of genetic data, (ii) the development of new methods for classification of biomedical time series, such as electrocardiograph (ECG) and electroencephalograph (EEG) and (iii) link prediction in biomedical networks.

What is the major research question?
Describe here briefly the problem to be solved by the research, the starting hypothesis, and the questions addressed by the experiments.
Two essential requirements of classification and link prediction in biomedical data are: (i) the development of classification and link prediction algorithms that take all the relevant properties of the data -- such as high dimensionality (e.g. gene expression data), imbalanced classes, noise, uncertainty, missing values, hierarchical class labels -- into account and (ii) the integration of biological/medical background knowledge into the algorithm. Existing approaches usually address the aforementioned challenges (such as high dimensionality, imbalanced classes, missing values, etc.) separately, whereas, in this project we aim to develop classifiers that are able to deal with these problems simultaneously, i.e., classifiers that are able to classify biomedical data at high accuracy, even if many of the aforementioned aspects characterize this data. Algorithms developed recently under the umbrella of hubness-aware data mining have been shown to work well in case of high dimensionality, imbalanced classes and even for time series classification. Therefore, we will most probably follow the paradigm of hubness-aware data mining in this project.

What is the significance of the research?
Describe the new perspectives opened by the results achieved, including the scientific basics of potential societal applications. Please describe the unique strengths of your proposal in comparison to your domestic and international competitors in the given field.
In this project, we will develop classifiers that support medical diagnosis. For example, classification of gene expression data may contribute to the diagnosis of cancer and cancer subtypes, while the analysis of ECG signal may contribute to early recognition of cardiovascular diseases. The relevance of classification in these domains is clearly shown by the fact that both cardiovascular diseases and cancer are primary causes of death worldwide.
Furthermore, different subtypes of cancer often require different treatments. For example, in the case of breast cancer, subtypes may be distinguished based on the estrogen receptor status, and the kind of treatment which appropriate for a particular subtype may be even harmful in case of an other subtype. Therefore, it is essential to recognize the subtype of the disease as accurate as possible. Classification of EEG (electroencephalograph) data is not only relevant to medical diagnosis (e.g. epilepsy), but it is one of the core components of numerous applications, such as EEG-controlled web browsers and spelling devices for paralyzed patients, or EEG-based assessment of car/truck driver's sleepiness. Link prediction in biomedical networks, in particular: drug-target prediction, is expected to contribute to better understanding of the chemical mechanisms of drugs and re-use of existing drugs in novel contexts: as an outcome of drug-target prediction, we expect some promising pieces of hypothesis claiming that existing drugs used for some particular diseases may be useful to treat some other diseases as well. This project is clearly positioned at the beginning of the innovation process, i.e., we will develop new analytic algorithms and publish these algorithms in scientific papers at international conferences and in journals. We hope that we can communicate our results to "users", i.e., developers of medical devices, drug developers, etc., and therefore we hope that the results of our project will be applied after the finalization of this project.

Summary and aims of the research for the public
Describe here the major aims of the research for an audience with average background information. This summary is especially important for NRDI Office in order to inform decision-makers, media, and others.
Classification is the common denominator in various recognition tasks related to biomedical data. In case of classification, we assume that the underlying recognition tasks are to be solved by computers in an automated way. Such recognition problems appear, for example, in case of medical diagnosis: classifiers may contribute to the diagnosis of cancer and its subtypes, and to the early detection of cardiovascular diseases. Classification of EEG (electroencephalograph) data, i.e. "brain waves", is not only relevant to medical diagnosis (e.g. epilepsy), but it is one of the core components of numerous applications, such as EEG-controlled web browsers and spelling devices for paralyzed patients, or EEG-based assessment of car/truck driver's sleepiness. The relevance of analytic tasks related to "brain waves" is also emphasized by the BRAIN initiative announced by president Barack Obama which is expected to result in an increased dynamics of brain research. Link prediction in biomedical networks, in particular: drug-target prediction, is expected to contribute to better understanding of the chemical mechanisms of drugs and re-use of existing drugs in novel contexts: as an outcome of drug-target prediction, we expect some promising pieces of hypothesis claiming that existing drugs used for some particular diseases may be useful to treat some other diseases as well. In this project, we aim to develop methods for the aforementioned classification and link prediction problems related to biomedical data. While an automated recognition algorithm will not replace medical doctors, such algorithms may be built into medical devices supporting doctors' work.

Final report

Results in Hungarian

A BioMining projektben adatbányászati eljárások orvosbiológiai alkalmazásaival foglalkoztunk. A csomósodás-alapú gépi tanulás paradigmáját követve fejlesztettünk új eljárásokat orvosbiológiai feladatokra, beleértve a génkifejezés-adatok és orvosbiológiai idősorok osztályozását, valamint a hatóanyagok és farmakológiai támadáspontok közötti kapcsolatok statisztikai predikcióját. A projekt során vizsgált osztályozási technikák a diagnosztikai eljárásokat támogathatják, míg a hatóanyagok és farmakológiai támadáspontok közötti kapcsolatok predikciója a gyógyszerfejlesztés folyamatát segítheti. A kutatási tervben vállalt összes feladatot megfelelően teljesítettük. Kiemeljük továbbá, hogy: (i) 9 impakt faktoros (Web of Science által indexált) folyóiratcikket jelentettünk meg, többek között a Knowledge-Based Systems-ben, Neurocomputing-ban, ill. a Frontiers in Neuroscience-ben, és további 8 konferencián/workshop-on mutattuk be munkánkat, (ii) 16 rövid videót vettünk fel, amelyek egy része egy online előadást alkot a csomósodás-alapú gépi tanulásról (http://www.biointelligence.hu/course.html), más része pedig a projekt legfontosabb eredményeit mutatja be (https://www.youtube.com/playlist?list=PLNWnqkAEYZk1ENQAcydQMHdgQ_KV36-oU); (iii) kifejlesztettük a PyHubs szoftverkönyvtárat (http://www.biointelligence.hu/pyhubs/) és további szoftverkódokat publikáltunk.

Results in English

The BioMining project focused on data mining for biomedical tasks. In particular, we envisioned to develop new hubness-aware machine learning techniques for biomedical tasks, including the classification of gene expression data and biomedical signals as well as drug-target interaction prediction. Classification is related to medical diagnosis, whereas drug-target interaction prediction may increase the efficiency of drug development by delivering promising hypothesises. All tasks of the research plan were implemented appropriately. Regarding the outcome of the project, we point out that: (i) 9 articles have been published in journals indexed by Web of Science, including Knowledge-Based Systems, Neurocomputing and Frontiers in Neuroscience, and further 8 conference or workshop papers were presented at international conferences or workshops; (ii) we recorded 16 short videos with a total length of approx. 100 minutes: these videos are organized into two playlists: one of them is an online lecture about hubness-aware machine learning (http://www.biointelligence.hu/course.html), while the other one summarizes key achievements of the project (https://www.youtube.com/playlist?list=PLNWnqkAEYZk1ENQAcydQMHdgQ_KV36-oU); (iii) we developed the PyHubs software library (http://www.biointelligence.hu/pyhubs/) and published further software codes.

Full text

https://www.otka-palyazat.hu/download.php?type=zarobeszamolo&projektid=111710

Decision

Yes

List of publications

Krisztian Buza, Ladislav Peska: Drug–target interaction prediction with Bipartite Local Models and hubness-aware regression, Neurocomputing, Volume 260, 284-293, 2017

Krisztian Buza: Classification of Gene Expression Data: A Hubness-aware Semi-Supervised Approach, Computer Methods and Programs in Biomedicine, Volume 127, 105-113, 2016

Ladislav Peska, Krisztian Buza, Julia Koller: Drug-Target Interaction Prediction: a Bayesian Ranking Approach, Computer Methods and Programs in Biomedicine, Vol. 152, pp. 15-21, 2017

Regina J. Meszlényi, Petra Hermann, Krisztian Buza, Viktor Gál, Zoltán Vidnyánszky: Resting State fMRI Functional Connectivity Analysis Using Dynamic Time Warping, Frontiers in Neuroscience, Volume 11, Article 75, 2017

Nenad Tomasev, Krisztian Buza, Dunja Mladenic: Correcting the Hub Occurrence Prediction Bias in Many Dimensions, Computer Science and Information Systems, Vol. 13, Issue 1, 2016

Krisztian Buza, Júlia Koller: Classification of Electroencephalograph Data: A Hubness-aware Approach, Acta Polytechnica Hungarica, Vol. 13, No. 2, pp. 27-46, 2016

Krisztian Buza, Noémi Ágnes Varga: ParkinsoNET: Estimation of UPDRS Score using Hubness-aware Feed-Forward Neural Networks, Applied Artificial Intelligence, Volume 30, Issue 6, pp. 541-555, 2016

Krisztian Buza, Alexandros Nanopoulos, Gabor Nagy: Nearest neighbor regression in the presence of bad hubs, Knowledge-Based Systems, Volume 86, 250-260, http://www.sciencedirect.com/science/article/pii/S0950705115002282, 2015

Nenad Tomasev, Krisztian Buza: Hubness-aware kNN classification of high-dimensional data in presence of label noise, Neurocomputing, Volume 160, 157-172, http://www.sciencedirect.com/science/article/pii/S0925231215001228, 2015

Krisztian Buza, Julia Koller, Kristof Marussy: PROCESS: Projection-Based Classification of Electroencephalograph Signals, Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, Vol. 9120, pp. 91-100, Springer, http://link.springer.com/chapter/10.1007/978-3-319-19369-4, 2015

Krisztian Buza: Semi-supervised Naive Hubness-Bayesian k-Nearest Neighbor for Gene Expression Data, to appear in the Proceedings of the 9th International Conference on Computer Recognition Systems (CORES), Springer, 2015

Krisztian Buza, Noémi Ágnes Varga: Machine Learning for the Estimation of UPDRS score, VII. Dubrovnik Conference on Cognitive Science (DUCOG), 2015

Krisztian Buza: Hubness: An Interesting Property of Nearest Neighbor Graphs and its Impact on Classification, 9th Japanese-Hungarian Symposium on Discrete Mathematics and Its Applications, invited talk, 2015

Kristof Marussy, Ladislav Peška, Krisztian Buza: Recommendations of Unique Items Based on Bipartite Graphs, 9th Japanese-Hungarian Symposium on Discrete Mathematics and Its Applications, 2015

Krisztian Buza, Kristof Marussy: PROGRESS: Projection-Based Gene Expression Classification, Innovations in Medicine Conference, 2014

Krisztian Buza: Classification of Gene Expression Data: A Hubness-aware Semi-Supervised Approach, Computer Methods and Programs in Biomedicine, Volume 127, 105-113, 2016

Krisztian Buza, Júlia Koller: Classification of Electroencephalograph Data: A Hubness-aware Approach, Acta Polytechnica Hungarica, Vol. 13, No. 2, 2016

Nenad Tomasev, Krisztian Buza, Dunja Mladenic: Correcting the Hub Occurrence Prediction Bias in Many Dimensions, Computer Science and Information Systems, Vol. 13, Issue 1, 2016

Krisztian Buza, Noémi Ágnes Varga: ParkinsoNET: Estimation of UPDRS Score using Hubness-aware Feed-Forward Neural Networks, Applied Artificial Intelligence, Volume 30, Issue 6, 2016

Krisztian Buza: Drug-Target Interaction Prediction with Hubness-aware Machine Learning, 11th IEEE International Symposium on Applied Computational Intelligence and Informatics, 2016

Krisztian Buza, Dora Neubrandt: How You Type Is Who You Are, 11th IEEE International Symposium on Applied Computational Intelligence and Informatics, 2016

Krisztian Buza: Person Identification Based on Keystroke Dynamics: Demo and Open Challenge, Forum at the 28th International Conference on Advanced Information Systems Engineering (CAiSE'16), 2016

Regina Meszlényi, Ladislav Peska, Viktor Gal, Zoltán Vidnyánszky, Krisztian Buza: A model for classification based on the functional connectivity pattern dynamics of the brain, Proceedings of the Third European Network Intelligence Conference, 2016

Regina Meszlényi, Ladislav Peska, Viktor Gal, Zoltán Vidnyánszky, Krisztian Buza: Classification of fMRI data using Dynamic Time Warping based functional connectivity analysis, Proceedings of the 24th European Signal Processing Conference, 2016

Krisztian Buza: Semi-supervised Naive Hubness-Bayesian k-Nearest Neighbor for Gene Expression Data, Proceedings of the 9th International Conference on Computer Recognition Systems CORES 2015, pp. 101-110, Springer, 2015

Krisztian Buza, Dora Neubrandt: How You Type Is Who You Are, 11th IEEE International Symposium on Applied Computational Intelligence and Informatics, pp. 453-456, 2016

Rodica Ioana Lung, Mihai Suciu, Regina Meszlényi, Krisztian Buza, Noémi Gaskó: Community structure detection for the functional connectivity networks of the brain, Parallel Problem Solving from Nature - PPSN XIV, pp 633-643, Springer, 2016

K. Buza, D. Neubrandt: A New Proposal for Person Identification Based on the Dynamics of Typing: Preliminary Results, Theoretical and Applied Informatics, Vol. 28, No. 1-2, 2016

Back »