Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content

Help

Back »

Details of project

Identifier

107885

Type

FNN

Principal investigator

Váradi, Tamás

Title in Hungarian

Finnugor nyelvű közösségek nyelvtechnológiai támogatása online tartalmak létrehozásában

Title in English

Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content

Keywords in Hungarian

finnugor nyelvek, párhuzamos korpusz, párbeszédszemantika, diskurzus a közösségi médiában, digitális közösségek nyelvtechnológiai támogatása

Keywords in English

finno-ugric languages, comparable corpora, dialogue semantics, language technology for digital communities, social media discourse

Discipline

Linguistics (Council of Humanities and Social Sciences)	100 %
Ortelius classification: Computational linguistics

Panel

Linguistics

Department or equivalent

HUN-REN Hungarian Research Centre for Linguistics

Participants

Benyeda, Ivett Zsuzsanna
Ferenczi, Zsanett
Héja, Enikő
Koczka, Péter
Kuti, Judit
Lendvai, Piroska
Ludányi, Zsófia
Ludányi, Zsófia
Mittelholcz, Iván
Oszkó, Beatrix
Simon, Eszter
Simon, Eszter
Tóth, Bianka

Starting date

2013-09-01

Closing date

2018-02-28

Funding (in million HUF)

40.212

FTE (full time equivalent)

16.73

state

closed project

Summary in Hungarian

A kutatás összefoglalója, célkitűzései szakemberek számára
Itt írja le a kutatás fő célkitűzéseit a témában jártas szakember számára.
The objective of the project is to provide linguistically-based support for several small Finno-Ugric digital communities in generating online content, and thereby promote multilingualism, and help revitalize the digital functions of endangered Finno-Ugric languages. The project will be based on comparable corpora that will be collected from the web as well as during fieldwork. (i) We will generate proto-dictionaries for several Finno-Ugric language pairs, and deploy the enriched lexical material on the web in the framework of the collaborative dictionary project Wiktionary. (ii) We will model social media communication and test it for Finno-Ugric languages in the Wikitalk application (Jokinen and Wilcock 2012).

Based on partially existing language resources for the Komi, Komi-Permyak, Udmurt, Mari, and Sami languages, and on the technology of proto-dictionary generation (Héja 2010) based on comparable corpora that the project will collect and annotate, we will implement a workflow for the creation and deployment of freely accessible online lexical resources, with the goal of enabling the translation of collaboratively created encyclopedia content, as it appears in Wikipedia. For (ii) analyzing and characterizing the language use in social media, we will collect data from dialogue-related genres such as online discussions, web forum posts, blog comments, and annotate them on levels ranging from grammatical up to discourse phenomena. This will enable learning novel forms of language use, to be utilized for displaying or conveying information in Wikitalk for Finno-Ugric language communities.

Mi a kutatás alapkérdése?
Ebben a részben írja le röviden, hogy mi a kutatás segítségével megválaszolni kívánt probléma, mi a kutatás kiinduló hipotézise, milyen kérdéseket válaszolnak meg a kísérletek.
Members of the smaller Finno-Ugric languages would benefit strongly from language support related to new media. The project aims to devise corpus-linguistics-based workflow and resources that provide online, interlinked linguistic support for Finno-Ugric language communities in translating articles already existing in some language editions of Wikipedia.

The project will create datasets of annotated comparable corpora based on data from Wikipedia and social media, compiled in less-supported Finno-Ugric languages as well as in relatively well-supported languages (English, Russian, Finnish, Hungarian). A semi-automatic lexicographic procedure will generate proto-dictionaries for the various langugage pairs. The dictionaries will be recasted in the form of lexico-grammatical data in Wiktionary, and linked in a standardized way across languages. The project will empirically test an NLP-supported workflow for crowdsourcing, as any community member can further edit, enrich, and interlink any entry in this collaboratively created lexical database.

Learning to model novel forms of language use based on social media discourse is hypothesized as beneficial for communicating information on these forums for small language communities. Mobilizing and revitalizing the Finno-Ugric digital community will be carried out by means of the Wikitalk application (Jokinen and Wilcock 2012) that can be used to increase the visibility of small Finno-Ugric languages, and interest in them.
Wikitalk will be adapted to FU languages, and tested in chatting about interesting topics on the basis of Wikipedia articles, experimenting with coherence and consistence of presentation.

Mi a kutatás jelentősége?
Röviden írja le, milyen új perspektívát nyitnak az alapkutatásban az elért eredmények, milyen társadalmi hasznosíthatóságnak teremtik meg a tudományos alapját. Mutassa be, hogy a megpályázott kutatási területen lévő hazai és a nemzetközi versenytársaihoz képest melyek az egyediségei és erősségei a pályázatának!
Targeting cross-domain, multi-lingual linguistic processing is not only a cutting-edge issue in language technology, but holds immediate significance for educating the generally interested public, promoting the survival of their language in the digital age. In Finno-Ugric language communities it is expected that the availability of interesting interactive systems as well as possibilities to play, look for information, and use the information provided in minor languages will increase interest in that particular language. This will contribute to the culture itself and help the communities to adapt to the globalized world through their own language, as well as empower them with the digital skills and knowledge.

Based on contributions from the two research teams, the proposed topic will be an important testbed, the output of which will enhance several lines of scholarly research in theoretical and application-oriented domains. Most importantly, these will prove the potential of language technology in the revitalization of digitally endangered Finno-Ugric languages, and enable community forming as well as linguistically supported crowdsourcing. We aim to produce project results (workflow, research environment, technology, resources) that will qualify to be directly included in training programs to enhance the competence and expertise of Finnish and Hungarian young researchers as well.

The project draws on several complementary workflows, tools, resources and expertise available at the two research sites. During the proposed project, these will be adapted to specific discourse and thematic areas, serving the needs of the target FU language communities in generating online content. Their linguistic support will be carried out in the framework of standardized, linked infrastructure instead of standalone applications, to allow for interoperability, machine reading, and sustainability. Next to serving the proposed project’s goals, importantly, these novel resources will also be an asset for the creation of language technology software for digitally endangered FU languages, ranging from spelling and grammar checkers to interactive personal assistants on smartphones.

A kutatás összefoglalója, célkitűzései laikusok számára
Ebben a fejezetben írja le a kutatás fő célkitűzéseit alapműveltséggel rendelkező laikusok számára. Ez az összefoglaló a döntéshozók, a média, illetve az érdeklődők tájékoztatása szempontjából különösen fontos az NKFI Hivatal számára.
Napjainkban az információszerzés és megosztás egyik legfontosabb forrása az internet. A most felnövő generáció sokszor kizárólagosan a webes tartalmakból tájékozódik, így a nyelvhasználat szempontjából nem mellékes körülmény, hogy ezek anyanyelven vagy idegen nyelven állnak rendelkezésre. Különösen fontos ez akkor, amikor az államalkotó, expanzióban lévő többségi nyelv mellett próbálja egy közösség megőrizni anyanyelvét.
A kutatás során veszélyeztetett státuszban lévő, kisebb finnugor nyelvek számára kívánunk nyelvtechnológiai segítséget nyújtani online tartalmak létrehozásában. Egyrészt párhuzamosított szövegekből automatikusan előállított, többnyelvű szótári információval, amelyeknek segítségével az anyanyelvi beszélők a világhálón található közösségi tartalmakat (például a Wikipédia szócikkeit) fejleszthetik. Másrészt a Wikitalk nevű gépi párbeszédrendszerrel, amivel az anyanyelvükön beszélgethetnek, például Wikipédia szócikkek tartalmáról. A rendszer kommunikációs stratégiáit a közösségi médiaoldalakon megjelenő, új típusú diskurzusok a projekt által létrehozott modelljeivel bővítjük. Várakozásaink szerint a kifejlesztett adatbázisok és eszközök jelentős szerepet játszhatnak a kisebb finnugor nyelvek vitalizálásában. A megcélzott, támogatni kívánt nyelvek a számi, a mari, az udmurt, a komi és a komi-permják.

Summary

Summary of the research and its aims for experts
Describe the major aims of the research for experts.
The objective of the project is to provide linguistically-based support for several small Finno-Ugric digital communities in generating online content, and thereby promote multilingualism, and help revitalize the digital functions of endangered Finno-Ugric languages. The project will be based on comparable corpora that will be collected from the web as well as during fieldwork. (i) We will generate proto-dictionaries for several Finno-Ugric language pairs, and deploy the enriched lexical material on the web in the framework of the collaborative dictionary project Wiktionary. (ii) We will model social media communication and test it for Finno-Ugric languages in the Wikitalk application (Jokinen and Wilcock 2012).

Based on partially existing language resources for the Komi, Komi-Permyak, Udmurt, Mari, and Sami languages, and on the technology of proto-dictionary generation (Héja 2010) based on comparable corpora that the project will collect and annotate, we will implement a workflow for the creation and deployment of freely accessible online lexical resources, with the goal of enabling the translation of collaboratively created encyclopedia content, as it appears in Wikipedia. For (ii) analyzing and characterizing the language use in social media, we will collect data from dialogue-related genres such as online discussions, web forum posts, blog comments, and annotate them on levels ranging from grammatical up to discourse phenomena. This will enable learning novel forms of language use, to be utilized for displaying or conveying information in Wikitalk for Finno-Ugric language communities.

What is the major research question?
Describe here briefly the problem to be solved by the research, the starting hypothesis, and the questions addressed by the experiments.
Members of the smaller Finno-Ugric languages would benefit strongly from language support related to new media. The project aims to devise corpus-linguistics-based workflow and resources that provide online, interlinked linguistic support for Finno-Ugric language communities in translating articles already existing in some language editions of Wikipedia.

The project will create datasets of annotated comparable corpora based on data from Wikipedia and social media, compiled in less-supported Finno-Ugric languages as well as in relatively well-supported languages (English, Russian, Finnish, Hungarian). A semi-automatic lexicographic procedure will generate proto-dictionaries for the various langugage pairs. The dictionaries will be recasted in the form of lexico-grammatical data in Wiktionary, and linked in a standardized way across languages. The project will empirically test an NLP-supported workflow for crowdsourcing, as any community member can further edit, enrich, and interlink any entry in this collaboratively created lexical database.

Learning to model novel forms of language use based on social media discourse is hypothesized as beneficial for communicating information on these forums for small language communities. Mobilizing and revitalizing the Finno-Ugric digital community will be carried out by means of the Wikitalk application (Jokinen and Wilcock 2012) that can be used to increase the visibility of small Finno-Ugric languages, and interest in them.
Wikitalk will be adapted to FU languages, and tested in chatting about interesting topics on the basis of Wikipedia articles, experimenting with coherence and consistence of presentation.

What is the significance of the research?
Describe the new perspectives opened by the results achieved, including the scientific basics of potential societal applications. Please describe the unique strengths of your proposal in comparison to your domestic and international competitors in the given field.
Targeting cross-domain, multi-lingual linguistic processing is not only a cutting-edge issue in language technology, but holds immediate significance for educating the generally interested public, promoting the survival of their language in the digital age. In Finno-Ugric language communities it is expected that the availability of interesting interactive systems as well as possibilities to play, look for information, and use the information provided in minor languages will increase interest in that particular language. This will contribute to the culture itself and help the communities to adapt to the globalized world through their own language, as well as empower them with the digital skills and knowledge.

Based on contributions from the two research teams, the proposed topic will be an important testbed, the output of which will enhance several lines of scholarly research in theoretical and application-oriented domains. Most importantly, these will prove the potential of language technology in the revitalization of digitally endangered Finno-Ugric languages, and enable community forming as well as linguistically supported crowdsourcing. We aim to produce project results (workflow, research environment, technology, resources) that will qualify to be directly included in training programs to enhance the competence and expertise of Finnish and Hungarian young researchers as well.

The project draws on several complementary workflows, tools, resources and expertise available at the two research sites. During the proposed project, these will be adapted to specific discourse and thematic areas, serving the needs of the target FU language communities in generating online content. Their linguistic support will be carried out in the framework of standardized, linked infrastructure instead of standalone applications, to allow for interoperability, machine reading, and sustainability. Next to serving the proposed project’s goals, importantly, these novel resources will also be an asset for the creation of language technology software for digitally endangered FU languages, ranging from spelling and grammar checkers to interactive personal assistants on smartphones.

Summary and aims of the research for the public
Describe here the major aims of the research for an audience with average background information. This summary is especially important for NRDI Office in order to inform decision-makers, media, and others.
The digital revolution of our era makes a dramatic impact on nearly all aspects of society. Today, information need is typically being covered from online, collaboratively edited material, of which Wikipedia is a prominent example. In more personal spheres of life, a novel phenomenon is that interaction is being conducted via social media platforms and applications. Language communities are most sensitive to new paradigms in communication technology. The new concepts that are brought to versatile, small language communities – such as speakers of Finno-Ugric languages – will impact their linguistic scenery to a significant extent, shifting new segments of native language use towards 'globalized', non-native language use.

Members of the smaller Finno-Ugric language communities would benefit strongly from linguistic support related to new media. The project will investigate how modern language technology and corpus-based linguistic research can make a significant contribution to facing such a challenge. Based on large amounts of parallel texts, we will semi-automatically generate online dictionaries to support translating online content, such as Wikipedia from English, Russian, Finnish, and Hungarian to endangered languages like Komi, Komi-Permyak, Udmurt, Mari, and Sami. The project plans to additionally mobilize these small online communities via the Wikitalk application, an automated conversational agent with whom one can chat about interesting topics in their native language. We expect that the novel workflow and resources generated by the project will help revitalize digitally endangered languages, and help raise Finno-Ugric digital natives.

Final report

Results in Hungarian

A projekt célja az volt, hogy kisebbségi finnugor nyelvű közösségeket segítsen a digitális revitalizációban azáltal, hogy online tartalmakat hoz létre. A projekt során hat kisebbségi finnugor (komi-permják és -zürjén, udmurt, mezei és hegyi mari, északi számi) és négy széles körben használt nyelvre (angol, finn, magyar, orosz), vagyis összesen 24 nyelvpárra állítottunk elő szótárakat. Mivel az általunk vizsgált kisebbségi finnugor nyelvekre nincsenek megfelelő méretű korpuszok és szövegfeldolgozó eszközök, amik kellenének a sztenderd szótárépítési módszerekhez, alternatív módszerekkel kísérleteztünk. A módszereket és az így kapott szótárakat kiértékeltük, az eredményeket publikáltuk. A kétnyelvű szótárakból teljesen automatikusan Wiktionary-szócikkeket generáltunk. Például az északi számi–finn nyelvpár esetén az északi számi szó a finn Wikisanakirja egy új címszava, míg a finn megfelelője a definíció lett. A szócikkeket a magyar és a finn Wiktionary-kiadásokba feltöltöttük. A szótári elemek a Wiktionary különböző nyelvű változataiban összekapcsolhatók, az interwiki linkek pedig a Wikipédia felé biztosítják az átjárást. Ez lehetővé teszi, hogy a nyelvközösségek gazdag lexikai anyaghoz férjenek hozzá. Ezen felül a Wiktionary anyaga bekerül a mindennap frissülő BabelNetbe, amin keresztül a szótáraink a szemantikus web részévé válnak. A projekt eredményeit a finnotka.nytud.hu oldalon keresztül is elérhetővé tettük, ahonnan letölthetőek a szótáraink, valamint egy keresőt is fejlesztettünk, amivel mind a 24 általunk előállított szótárban lehet keresni.

Results in English

The aim of the project was to support Finno-Ugric (FU) minority language communities in the digital revitalization with creating online content. For this purpose, we created dictionaries for six FU minority languages (Komi-Permyak, Komi-Zyrian, Hill Mari, Meadow Mari, Northern Saami, Udmurt) and four widely used languages (English, Finnish, Hungarian, Russian), altogether 24 bilingual dictionaries. Since the standard dictionary creating methods require large amounts of corpora and text processing tools, which do not exist for these FU languages, we investigated several alternative methods. The methods and the resulting dictionaries were thoroughly evaluated; the results have been published. The bilingual dictionaries were the input of the fully automatically generated Wiktionary entries. For example in the case of the Northern Saami–Finnish language pair, the Northern Saami word became a new entry in the Finnish Wikisanakirja, while its Finnish counterpart became the definition of this new entry. The entries have already been uploaded to the Finnish and Hungarian editions of Wiktionary. Wiktionary entries in different languages are linked to each other, while interwiki links grant access to Wikipedia. Therefore, digital language communities can access rich lexical material. Moreover, Wiktionary dumps are every day updated in BabelNet, via which our dictionaries became a part of the semantic web. We make our results available through the finnotka.nytud.hu web site, form which the dictionaries can be downloaded and we developed a dictionary search tool for searching in all of the 24 directions.

Full text

https://www.otka-palyazat.hu/download.php?type=zarobeszamolo&projektid=107885

Decision

Yes

List of publications

Zsanett Ferenczi, Iván Mittelholcz, Eszter Simon: Automatic Generation of Wiktionary Entries for Finno-Ugric Minority Languages, In: Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages. Association for Computational Linguistics, 2018. pp. 39-50., 2018

Eszter Simon, Iván Mittelholcz: Evaluation of Dictionary Creating Methods for Under-Resourced Languages, In: Kamil Ekštein, Václav Matoušek (szerk.): Text, Speech and Dialogue: 20th International Conference, TSD 2017, Prague. Cham: Springer, 2017. pp. 246-254., 2017

Eszter Simon, Ivett Zs. Benyeda, Péter Koczka, Zsófia Ludányi:: Automatic creation of bilingual dictionaries for Finno-Ugric languages., In: Proceedings of the First International Workshop on Computational Linguistics for Uralic Languages, Tromsø, 2015. pp. 119-131., 2015

Benyeda Ivett, Koczka Péter, Ludányi Zsófia, Simon Eszter, Váradi Tamás: Finnugor nyelvű közösségek nyelvtechnológiai támogatása online tartalmak létrehozásában., In: Tanács Attila, Varga Viktor, Vincze Veronika (szerk.): XI. Magyar Számítógépes Nyelvészeti Konferencia, SzTE, Szeged, 2015. p. 133-144., 2015

Simon Eszter:: Finnugor nyelvű közösségek támogatása online tartalmak létrehozásában., In: Édes Anyanyelvünk (ISSN: 0139-0457) 36(5): 14. (2014), 2014

Ivett Benyeda, Péter Koczka and Tamás Váradi: Creating seed lexicons for under-resourced languages, http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-GLOBALEX_Proceedings-v2.pdf, 2016

Eszter Simon, Iván Mittelholz: Evaluation of Dictionary Creating Methods for Under-Resourced Languages, In: Kamil Ekštein, Václav Matoušek (szerk.): Text, Speech and Dialogue: 20th International Conference, TSD 2017, Prague.. Cham: Springer, 2017. pp. 246-254., 2017

Ivett Benyeda, Péter Koczka and Tamás Váradi: Creating seed lexicons for under-resourced languages, Proceedings of the GLOBALEX 2016 workshop. ELRA, 2016. pp. 52-56., 2016

Zsanett Ferenczi, Iván Mittelholcz, Eszter Simon, Tamás Váradi: Evaluation of Dictionary Creating Methods for Finno-Ugric Minority Languages, In: Proceedings of LREC2018 (közlésre elfogadva), 2018

Simon Eszter, Mittelholcz Iván, Ferenczi Zsanett: Lexikai erőforrások automatikus előállítása kisebbségi finnugor nyelvekre, Vincze Veronika (szerk.): XIV. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2018). Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport, 2018. pp. 260-271., 2018

Simon Eszter, Mittelholcz Iván, Ferenczi Zsanett: Automatikus szótárépítés kisebbségi finnugor nyelvekre, Pletl Rita, Kovács Gabriella (eds.): Trans-Linguistica – Multilingualism and Plurilingualism in Europe. EME-Scientia Publishing House, Cluj-Napoca (közlésre elfogadva), 2018

Benyeda Ivett, Koczka Péter, Ludányi Zsófia, Simon Eszter: Automatikus szótárgenerálás finnugor nyelvekre, In: 25. Magyar Alkalmazott Nyelvészeti Kongresszus absztraktfüzet. 2. oldal, 2015

Ivett Benyeda, Eszter Simon, Péter Koczka: How Can Language Technology Fight Against Language Death?, In: Book of Abstracts for the Second Digital Humanities Benelux Conference, University of Antwerp, 2015. pages 53-54., 2015

Ivett Benyeda, Péter Koczka, Zsófia Ludányi, Eszter Simon: Language technology support for Finno-Ugric digital communities, In: Congressus Duodecimus Internationalis Fenno-Ugristarum, Oulu 2015. Book of Abstracts. pages 326-327, 2015

Tommi A. Pirinen, Francis M. Tyers, Eszter Simon, Veronika Vincze: Guest editors’ note, Acta Linguistica Academica, 64(3), pp. 325–326, 2017

Tommi A Pirinen; Eszter Simon; Francis M Tyers; Veronika Vincze: Report on the Second International Workshop on Computational Linguistics for Uralic Languages, FINNO-UGRIC LANGUAGES AND LINGUISTICS 5: (1) pp. 1-5. (2016), 2016

Tommi A Pirinen; Eszter Simon; Francis F Tyers; Veronika Vincze: Proceedings of the Second International Workshop on Computational Linguistics for Uralic Languages, Szeged: Szegedi Tudományegyetem, 2016., 2016

Tommi A Pirinen; Trond Trosterud; Francis M Tyers; Veronika Vincze; Eszter Simon; Jack Rueter: Foreword to the Special Issue on Uralic Languages, NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY 4: Paper 1. 9 p. (2016), 2016

Simon Eszter:: Finnugor nyelvű közösségek támogatása online tartalmak létrehozásában., In: Édes Anyanyelvünk (ISSN: 0139-0457) 36(5): 14. (2014), 2014

Tamás Váradi, Ivett Benyeda, Péter Koczka: Automatic Lexicon Creation to Support the Digital Vitality of Endangered Uralic Languages, Proceedings of the HrTAL2016 conference (közlésre elküldve), 2016

Events of the project

2017-10-09 14:21:43

Résztvevők változása

2017-06-08 16:15:26

Résztvevők változása

2016-10-25 11:52:09

Résztvevők változása

2015-11-12 09:04:57

Résztvevők változása

Back »