|
Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content
|
Help
Print
|
Here you can view and search the projects funded by NKFI since 2004
Back »
|
|
Details of project |
|
|
Identifier |
107885 |
Type |
FNN |
Principal investigator |
Váradi, Tamás |
Title in Hungarian |
Finnugor nyelvű közösségek nyelvtechnológiai támogatása online tartalmak létrehozásában |
Title in English |
Finno-Ugric Digital Natives: Linguistic support for Finno-Ugric digital communities in generating online content |
Keywords in Hungarian |
finnugor nyelvek, párhuzamos korpusz, párbeszédszemantika, diskurzus a közösségi médiában, digitális közösségek nyelvtechnológiai támogatása |
Keywords in English |
finno-ugric languages, comparable corpora, dialogue semantics, language technology for digital communities, social media discourse |
Discipline |
Linguistics (Council of Humanities and Social Sciences) | 100 % | Ortelius classification: Computational linguistics |
|
Panel |
Linguistics |
Department or equivalent |
HUN-REN Hungarian Research Centre for Linguistics |
Participants |
Benyeda, Ivett Zsuzsanna Ferenczi, Zsanett Héja, Enikő Koczka, Péter Kuti, Judit Lendvai, Piroska Ludányi, Zsófia Ludányi, Zsófia Mittelholcz, Iván Oszkó, Beatrix Simon, Eszter Simon, Eszter Tóth, Bianka
|
Starting date |
2013-09-01 |
Closing date |
2018-02-28 |
Funding (in million HUF) |
40.212 |
FTE (full time equivalent) |
16.73 |
state |
closed project |
Summary in Hungarian A kutatás összefoglalója, célkitűzései szakemberek számára Itt írja le a kutatás fő célkitűzéseit a témában jártas szakember számára. The objective of the project is to provide linguistically-based support for several small Finno-Ugric digital communities in generating online content, and thereby promote multilingualism, and help revitalize the digital functions of endangered Finno-Ugric languages. The project will be based on comparable corpora that will be collected from the web as well as during fieldwork. (i) We will generate proto-dictionaries for several Finno-Ugric language pairs, and deploy the enriched lexical material on the web in the framework of the collaborative dictionary project Wiktionary. (ii) We will model social media communication and test it for Finno-Ugric languages in the Wikitalk application (Jokinen and Wilcock 2012).
Based on partially existing language resources for the Komi, Komi-Permyak, Udmurt, Mari, and Sami languages, and on the technology of proto-dictionary generation (Héja 2010) based on comparable corpora that the project will collect and annotate, we will implement a workflow for the creation and deployment of freely accessible online lexical resources, with the goal of enabling the translation of collaboratively created encyclopedia content, as it appears in Wikipedia. For (ii) analyzing and characterizing the language use in social media, we will collect data from dialogue-related genres such as online discussions, web forum posts, blog comments, and annotate them on levels ranging from grammatical up to discourse phenomena. This will enable learning novel forms of language use, to be utilized for displaying or conveying information in Wikitalk for Finno-Ugric language communities.
Mi a kutatás alapkérdése? Ebben a részben írja le röviden, hogy mi a kutatás segítségével megválaszolni kívánt probléma, mi a kutatás kiinduló hipotézise, milyen kérdéseket válaszolnak meg a kísérletek. Members of the smaller Finno-Ugric languages would benefit strongly from language support related to new media. The project aims to devise corpus-linguistics-based workflow and resources that provide online, interlinked linguistic support for Finno-Ugric language communities in translating articles already existing in some language editions of Wikipedia.
The project will create datasets of annotated comparable corpora based on data from Wikipedia and social media, compiled in less-supported Finno-Ugric languages as well as in relatively well-supported languages (English, Russian, Finnish, Hungarian). A semi-automatic lexicographic procedure will generate proto-dictionaries for the various langugage pairs. The dictionaries will be recasted in the form of lexico-grammatical data in Wiktionary, and linked in a standardized way across languages. The project will empirically test an NLP-supported workflow for crowdsourcing, as any community member can further edit, enrich, and interlink any entry in this collaboratively created lexical database.
Learning to model novel forms of language use based on social media discourse is hypothesized as beneficial for communicating information on these forums for small language communities. Mobilizing and revitalizing the Finno-Ugric digital community will be carried out by means of the Wikitalk application (Jokinen and Wilcock 2012) that can be used to increase the visibility of small Finno-Ugric languages, and interest in them. Wikitalk will be adapted to FU languages, and tested in chatting about interesting topics on the basis of Wikipedia articles, experimenting with coherence and consistence of presentation.
Mi a kutatás jelentősége? Röviden írja le, milyen új perspektívát nyitnak az alapkutatásban az elért eredmények, milyen társadalmi hasznosíthatóságnak teremtik meg a tudományos alapját. Mutassa be, hogy a megpályázott kutatási területen lévő hazai és a nemzetközi versenytársaihoz képest melyek az egyediségei és erősségei a pályázatának! Targeting cross-domain, multi-lingual linguistic processing is not only a cutting-edge issue in language technology, but holds immediate significance for educating the generally interested public, promoting the survival of their language in the digital age. In Finno-Ugric language communities it is expected that the availability of interesting interactive systems as well as possibilities to play, look for information, and use the information provided in minor languages will increase interest in that particular language. This will contribute to the culture itself and help the communities to adapt to the globalized world through their own language, as well as empower them with the digital skills and knowledge.
Based on contributions from the two research teams, the proposed topic will be an important testbed, the output of which will enhance several lines of scholarly research in theoretical and application-oriented domains. Most importantly, these will prove the potential of language technology in the revitalization of digitally endangered Finno-Ugric languages, and enable community forming as well as linguistically supported crowdsourcing. We aim to produce project results (workflow, research environment, technology, resources) that will qualify to be directly included in training programs to enhance the competence and expertise of Finnish and Hungarian young researchers as well.
The project draws on several complementary workflows, tools, resources and expertise available at the two research sites. During the proposed project, these will be adapted to specific discourse and thematic areas, serving the needs of the target FU language communities in generating online content. Their linguistic support will be carried out in the framework of standardized, linked infrastructure instead of standalone applications, to allow for interoperability, machine reading, and sustainability. Next to serving the proposed project’s goals, importantly, these novel resources will also be an asset for the creation of language technology software for digitally endangered FU languages, ranging from spelling and grammar checkers to interactive personal assistants on smartphones.
A kutatás összefoglalója, célkitűzései laikusok számára Ebben a fejezetben írja le a kutatás fő célkitűzéseit alapműveltséggel rendelkező laikusok számára. Ez az összefoglaló a döntéshozók, a média, illetve az érdeklődők tájékoztatása szempontjából különösen fontos az NKFI Hivatal számára. Napjainkban az információszerzés és megosztás egyik legfontosabb forrása az internet. A most felnövő generáció sokszor kizárólagosan a webes tartalmakból tájékozódik, így a nyelvhasználat szempontjából nem mellékes körülmény, hogy ezek anyanyelven vagy idegen nyelven állnak rendelkezésre. Különösen fontos ez akkor, amikor az államalkotó, expanzióban lévő többségi nyelv mellett próbálja egy közösség megőrizni anyanyelvét. A kutatás során veszélyeztetett státuszban lévő, kisebb finnugor nyelvek számára kívánunk nyelvtechnológiai segítséget nyújtani online tartalmak létrehozásában. Egyrészt párhuzamosított szövegekből automatikusan előállított, többnyelvű szótári információval, amelyeknek segítségével az anyanyelvi beszélők a világhálón található közösségi tartalmakat (például a Wikipédia szócikkeit) fejleszthetik. Másrészt a Wikitalk nevű gépi párbeszédrendszerrel, amivel az anyanyelvükön beszélgethetnek, például Wikipédia szócikkek tartalmáról. A rendszer kommunikációs stratégiáit a közösségi médiaoldalakon megjelenő, új típusú diskurzusok a projekt által létrehozott modelljeivel bővítjük. Várakozásaink szerint a kifejlesztett adatbázisok és eszközök jelentős szerepet játszhatnak a kisebb finnugor nyelvek vitalizálásában. A megcélzott, támogatni kívánt nyelvek a számi, a mari, az udmurt, a komi és a komi-permják.
| Summary Summary of the research and its aims for experts Describe the major aims of the research for experts. The objective of the project is to provide linguistically-based support for several small Finno-Ugric digital communities in generating online content, and thereby promote multilingualism, and help revitalize the digital functions of endangered Finno-Ugric languages. The project will be based on comparable corpora that will be collected from the web as well as during fieldwork. (i) We will generate proto-dictionaries for several Finno-Ugric language pairs, and deploy the enriched lexical material on the web in the framework of the collaborative dictionary project Wiktionary. (ii) We will model social media communication and test it for Finno-Ugric languages in the Wikitalk application (Jokinen and Wilcock 2012).
Based on partially existing language resources for the Komi, Komi-Permyak, Udmurt, Mari, and Sami languages, and on the technology of proto-dictionary generation (Héja 2010) based on comparable corpora that the project will collect and annotate, we will implement a workflow for the creation and deployment of freely accessible online lexical resources, with the goal of enabling the translation of collaboratively created encyclopedia content, as it appears in Wikipedia. For (ii) analyzing and characterizing the language use in social media, we will collect data from dialogue-related genres such as online discussions, web forum posts, blog comments, and annotate them on levels ranging from grammatical up to discourse phenomena. This will enable learning novel forms of language use, to be utilized for displaying or conveying information in Wikitalk for Finno-Ugric language communities.
What is the major research question? Describe here briefly the problem to be solved by the research, the starting hypothesis, and the questions addressed by the experiments. Members of the smaller Finno-Ugric languages would benefit strongly from language support related to new media. The project aims to devise corpus-linguistics-based workflow and resources that provide online, interlinked linguistic support for Finno-Ugric language communities in translating articles already existing in some language editions of Wikipedia.
The project will create datasets of annotated comparable corpora based on data from Wikipedia and social media, compiled in less-supported Finno-Ugric languages as well as in relatively well-supported languages (English, Russian, Finnish, Hungarian). A semi-automatic lexicographic procedure will generate proto-dictionaries for the various langugage pairs. The dictionaries will be recasted in the form of lexico-grammatical data in Wiktionary, and linked in a standardized way across languages. The project will empirically test an NLP-supported workflow for crowdsourcing, as any community member can further edit, enrich, and interlink any entry in this collaboratively created lexical database.
Learning to model novel forms of language use based on social media discourse is hypothesized as beneficial for communicating information on these forums for small language communities. Mobilizing and revitalizing the Finno-Ugric digital community will be carried out by means of the Wikitalk application (Jokinen and Wilcock 2012) that can be used to increase the visibility of small Finno-Ugric languages, and interest in them. Wikitalk will be adapted to FU languages, and tested in chatting about interesting topics on the basis of Wikipedia articles, experimenting with coherence and consistence of presentation.
What is the significance of the research? Describe the new perspectives opened by the results achieved, including the scientific basics of potential societal applications. Please describe the unique strengths of your proposal in comparison to your domestic and international competitors in the given field. Targeting cross-domain, multi-lingual linguistic processing is not only a cutting-edge issue in language technology, but holds immediate significance for educating the generally interested public, promoting the survival of their language in the digital age. In Finno-Ugric language communities it is expected that the availability of interesting interactive systems as well as possibilities to play, look for information, and use the information provided in minor languages will increase interest in that particular language. This will contribute to the culture itself and help the communities to adapt to the globalized world through their own language, as well as empower them with the digital skills and knowledge.
Based on contributions from the two research teams, the proposed topic will be an important testbed, the output of which will enhance several lines of scholarly research in theoretical and application-oriented domains. Most importantly, these will prove the potential of language technology in the revitalization of digitally endangered Finno-Ugric languages, and enable community forming as well as linguistically supported crowdsourcing. We aim to produce project results (workflow, research environment, technology, resources) that will qualify to be directly included in training programs to enhance the competence and expertise of Finnish and Hungarian young researchers as well.
The project draws on several complementary workflows, tools, resources and expertise available at the two research sites. During the proposed project, these will be adapted to specific discourse and thematic areas, serving the needs of the target FU language communities in generating online content. Their linguistic support will be carried out in the framework of standardized, linked infrastructure instead of standalone applications, to allow for interoperability, machine reading, and sustainability. Next to serving the proposed project’s goals, importantly, these novel resources will also be an asset for the creation of language technology software for digitally endangered FU languages, ranging from spelling and grammar checkers to interactive personal assistants on smartphones.
Summary and aims of the research for the public Describe here the major aims of the research for an audience with average background information. This summary is especially important for NRDI Office in order to inform decision-makers, media, and others. The digital revolution of our era makes a dramatic impact on nearly all aspects of society. Today, information need is typically being covered from online, collaboratively edited material, of which Wikipedia is a prominent example. In more personal spheres of life, a novel phenomenon is that interaction is being conducted via social media platforms and applications. Language communities are most sensitive to new paradigms in communication technology. The new concepts that are brought to versatile, small language communities – such as speakers of Finno-Ugric languages – will impact their linguistic scenery to a significant extent, shifting new segments of native language use towards 'globalized', non-native language use.
Members of the smaller Finno-Ugric language communities would benefit strongly from linguistic support related to new media. The project will investigate how modern language technology and corpus-based linguistic research can make a significant contribution to facing such a challenge. Based on large amounts of parallel texts, we will semi-automatically generate online dictionaries to support translating online content, such as Wikipedia from English, Russian, Finnish, and Hungarian to endangered languages like Komi, Komi-Permyak, Udmurt, Mari, and Sami. The project plans to additionally mobilize these small online communities via the Wikitalk application, an automated conversational agent with whom one can chat about interesting topics in their native language. We expect that the novel workflow and resources generated by the project will help revitalize digitally endangered languages, and help raise Finno-Ugric digital natives.
|
|
|
|
|
|
|
|
|
List of publications |
|
|
Zsanett Ferenczi, Iván Mittelholcz, Eszter Simon: Automatic Generation of Wiktionary Entries for Finno-Ugric Minority Languages, In: Proceedings of the Fourth International Workshop on Computational Linguistics of Uralic Languages. Association for Computational Linguistics, 2018. pp. 39-50., 2018 | Eszter Simon, Iván Mittelholcz: Evaluation of Dictionary Creating Methods for Under-Resourced Languages, In: Kamil Ekštein, Václav Matoušek (szerk.): Text, Speech and Dialogue: 20th International Conference, TSD 2017, Prague. Cham: Springer, 2017. pp. 246-254., 2017 | Eszter Simon, Ivett Zs. Benyeda, Péter Koczka, Zsófia Ludányi:: Automatic creation of bilingual dictionaries for Finno-Ugric languages., In: Proceedings of the First International Workshop on Computational Linguistics for Uralic Languages, Tromsø, 2015. pp. 119-131., 2015 | Benyeda Ivett, Koczka Péter, Ludányi Zsófia, Simon Eszter, Váradi Tamás: Finnugor nyelvű közösségek nyelvtechnológiai támogatása online tartalmak létrehozásában., In: Tanács Attila, Varga Viktor, Vincze Veronika (szerk.): XI. Magyar Számítógépes Nyelvészeti Konferencia, SzTE, Szeged, 2015. p. 133-144., 2015 | Simon Eszter:: Finnugor nyelvű közösségek támogatása online tartalmak létrehozásában., In: Édes Anyanyelvünk (ISSN: 0139-0457) 36(5): 14. (2014), 2014 | Ivett Benyeda, Péter Koczka and Tamás Váradi: Creating seed lexicons for under-resourced languages, http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-GLOBALEX_Proceedings-v2.pdf, 2016 | Eszter Simon, Iván Mittelholz: Evaluation of Dictionary Creating Methods for Under-Resourced Languages, In: Kamil Ekštein, Václav Matoušek (szerk.): Text, Speech and Dialogue: 20th International Conference, TSD 2017, Prague.. Cham: Springer, 2017. pp. 246-254., 2017 | Eszter Simon, Ivett Zs. Benyeda, Péter Koczka, Zsófia Ludányi:: Automatic creation of bilingual dictionaries for Finno-Ugric languages., In: Proceedings of the First International Workshop on Computational Linguistics for Uralic Languages, Tromsø, 2015. pp. 119-131., 2015 | Ivett Benyeda, Péter Koczka and Tamás Váradi: Creating seed lexicons for under-resourced languages, Proceedings of the GLOBALEX 2016 workshop. ELRA, 2016. pp. 52-56., 2016 | Zsanett Ferenczi, Iván Mittelholcz, Eszter Simon, Tamás Váradi: Evaluation of Dictionary Creating Methods for Finno-Ugric Minority Languages, In: Proceedings of LREC2018 (közlésre elfogadva), 2018 | Simon Eszter, Mittelholcz Iván, Ferenczi Zsanett: Lexikai erőforrások automatikus előállítása kisebbségi finnugor nyelvekre, Vincze Veronika (szerk.): XIV. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2018). Szeged: Szegedi Tudományegyetem Informatikai Tanszékcsoport, 2018. pp. 260-271., 2018 | Benyeda Ivett, Koczka Péter, Ludányi Zsófia, Simon Eszter, Váradi Tamás: Finnugor nyelvű közösségek nyelvtechnológiai támogatása online tartalmak létrehozásában., In: Tanács Attila, Varga Viktor, Vincze Veronika (szerk.): XI. Magyar Számítógépes Nyelvészeti Konferencia, SzTE, Szeged, 2015. p. 133-144., 2015 | Simon Eszter, Mittelholcz Iván, Ferenczi Zsanett: Automatikus szótárépítés kisebbségi finnugor nyelvekre, Pletl Rita, Kovács Gabriella (eds.): Trans-Linguistica – Multilingualism and Plurilingualism in Europe. EME-Scientia Publishing House, Cluj-Napoca (közlésre elfogadva), 2018 | Benyeda Ivett, Koczka Péter, Ludányi Zsófia, Simon Eszter: Automatikus szótárgenerálás finnugor nyelvekre, In: 25. Magyar Alkalmazott Nyelvészeti Kongresszus absztraktfüzet. 2. oldal, 2015 | Ivett Benyeda, Eszter Simon, Péter Koczka: How Can Language Technology Fight Against Language Death?, In: Book of Abstracts for the Second Digital Humanities Benelux Conference, University of Antwerp, 2015. pages 53-54., 2015 | Ivett Benyeda, Péter Koczka, Zsófia Ludányi, Eszter Simon: Language technology support for Finno-Ugric digital communities, In: Congressus Duodecimus Internationalis Fenno-Ugristarum, Oulu 2015. Book of Abstracts. pages 326-327, 2015 | Tommi A. Pirinen, Francis M. Tyers, Eszter Simon, Veronika Vincze: Guest editors’ note, Acta Linguistica Academica, 64(3), pp. 325–326, 2017 | Tommi A Pirinen; Eszter Simon; Francis M Tyers; Veronika Vincze: Report on the Second International Workshop on Computational Linguistics for Uralic Languages, FINNO-UGRIC LANGUAGES AND LINGUISTICS 5: (1) pp. 1-5. (2016), 2016 | Tommi A Pirinen; Eszter Simon; Francis F Tyers; Veronika Vincze: Proceedings of the Second International Workshop on Computational Linguistics for Uralic Languages, Szeged: Szegedi Tudományegyetem, 2016., 2016 | Tommi A Pirinen; Trond Trosterud; Francis M Tyers; Veronika Vincze; Eszter Simon; Jack Rueter: Foreword to the Special Issue on Uralic Languages, NORTHERN EUROPEAN JOURNAL OF LANGUAGE TECHNOLOGY 4: Paper 1. 9 p. (2016), 2016 | Simon Eszter:: Finnugor nyelvű közösségek támogatása online tartalmak létrehozásában., In: Édes Anyanyelvünk (ISSN: 0139-0457) 36(5): 14. (2014), 2014 | Tamás Váradi, Ivett Benyeda, Péter Koczka: Automatic Lexicon Creation to Support the Digital Vitality of Endangered Uralic Languages, Proceedings of the HrTAL2016 conference (közlésre elküldve), 2016 |
|
|
|
|
|
|
Back »
|
|
|