Pridobivanje govornih virov: Prednosti in pomanjkljivosti različnih pristopov

Avtorji

Andreja Bizjak
Univerza v Mariboru, Fakulteta za elektrotehniko, računalništvo in informatiko
https://orcid.org/0009-0003-4951-5806

Ključne besede:

pridobivanje govornih virov, snemanje na daljavo, spontani govor, govorjena slovenščina, digitalna humanistika

Kratka vsebina

Elaborat sistematično obravnava različne pristope k pridobivanju govornih virov za slovenščino – od laboratorijskih in terenskih snemanj do sodobnejših spletnih oblik, kot so GWAPs, Collect4NLP, množičenje in občanska znanost. S primerjalno analizo, ki vključuje kriterije kakovosti zvoka, stroškov, razširljivosti in spontanosti govora, razkriva prednosti, omejitve in izvedbeno zahtevnost posameznih pristopov. Vključuje praktična priporočila za raziskovalce in razvijalce govornih tehnologij, ki se soočajo z izzivi snemanja na daljavo. Z združevanjem globalnih trendov in tujih praks z razmerami v slovenskem prostoru prispeva k razvoju digitalnih govornih virov ter krepitvi jezikovnih tehnologij za manj podprte jezike.

Prenosi

Podatki o prenosih še niso na voljo.

Biografija avtorja

Andreja Bizjak, Univerza v Mariboru, Fakulteta za elektrotehniko, računalništvo in informatiko

Asist. mag. Andreja Bizjak je zaposlena na Fakulteti za elektrotehniko, računalništvo in informatiko Univerze v Mariboru, kjer izvaja pedagoško delo na študijskem programu Medijske komunikacije in kot višja raziskovalka sodeluje pri raziskovalnih projektih s področja govornih in jezikovnih tehnologij. Njeni raziskovalni interesi vključujejo korpusno jezikoslovje, govorne vire ter metodlogijo zbiranja govornih podatkov. Objavila je več znanstvenih člankov in sodelovala pri razvoju govorne baze ARTUR, s čimer prispeva k razvoju slovenščine v digitalnem okolju.

Maribor, Slovenija. E-pošta: andreja.bizjak1@um.si

Literatura

Arhar Holdt, Š., Logar, N., Pori, E., & Kosem, I. (2021). “Game of Words”: Play the Game, Clean the Database. EURALEX XIX. Pridobljeno s XIX-Euralex-Proceedings-Lexicography-for-Inclusion.pdf

Awan, S. N., Shaikh, M. A., Awan, J. A., Abdalla, I., Lim, K. O., & Misono, S. (2023). Smartphone recordings are comparable to “gold standard” recordings for acoustic measurements of voice. Journal of voice.

Bizjak, A. (2025). Vloga občanske znanosti pri množičnem zbiranju govornih virov v slovenščini. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave, 13(1), 58-103.

Bonney, R., Phillips, T. B., Ballard, H. L., & Enck, J. W. (2016). Can citizen science enhance public understanding of science?. Public understanding of science, 25(1), 2–16. doi: 10.1177/0963662515607406

Čibej, J., Robida, N., & Krek, S. (2024). Nadgradnja Digitalne slovarske baze za slovenščino in Slovenskega oblikoslovnega leksikona Sloleks s podatki o govorjeni slovenščini: načrti in cilji. V M. Krajnc Ivič (ur.), Stanje in perspektive uporabe govornih virov v raziskavah govora (str. 27–39). Univerza v Mariboru, Univerzitetna založba. https://press.um.si/index.php/ump/catalog/book/898/chapter/46

Dorn, A., Stocker, R., & Stöckle, P. (2024). Old dialect words through the ages–the ABCs of dialect project. ARPHA Proceedings, 6, 213-216.

Entringer, N., Gilles, P., Martin, S., & Purschke, C. (2021). Schnëssen. Surveying language dynamics in Luxembourgish with a mobile research app. Linguistics Vanguard, 7(s1), 20190031.

Eskenazi, M., Levow, G. A., Meng, H., Parent, G., & Suendermann, D. (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. John Wiley & Sons.

Fiumara, J., Cieri, C., Wright, J., & Liberman, M. (2020). LanguageARC: Developing language resources through citizen linguistics. In Proceedings of the LREC 2020 Workshop Citizen Linguistics in Language Resource Development (CLLRD 2020).

Garaus, C., Garaus, M., & Wagner, U. M. (2024). Getting users involved in idea crowdsourcing initiatives: An experimental approach to stimulate intrinsic motivation and intention to submit. IEEE Transactions on Engineering Management, 71, 3700–3711. Pridobljeno s https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10387737

Gershenfeld, N. (2011). Physics of the future: how science will shape human destiny and our daily lives by the year 2100. Physics Today, 64(10), 56–56.

Heinisch, B., Oswald, K., Weißpflug, M., Shuttleworth, S., Belknap, G. (2021). Citizen Humanities. In: Vohland, K., et al. The Science of Citizen Science. Springer, Cham. https://doi.org/10.1007/978-3-030-58278-4_6

Heinisch, B. (2022). The Influence of Intrinsic and Extrinsic Motivation on the Creation of Language Resources in a Citizen Linguistics Project about Lexicography. In Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022 (pp. 58-63).

Hutin, M., & Allassonnière-Tang, M. (2022). Crowd-sourcing for less-resourced languages: Lingua libre for Polish. In 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (SIGUL 2022).

Kaufmann, N., Schulze, T., & Veit, D. (2011). More than fun and money. worker motivation in crowdsourcing–a study on mechanical turk. https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/45694/file/More_than_fun_and_money_Worker_Motivation_in_Crowd.pdf

Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., & Thomas, E. M. (2020). The national corpus of contemporary Welsh: Project report| Y corpws cenedlaethol Cymraeg cyfoes: adroddiad y prosiect. Pridobljeno s https://arxiv.org/abs/2010.05542

Koreinik, K., Mandel, A., Pilvik, M. L., Praakli, K., & Vihman, V. A. (2024). Outsourcing teenage language: A participatory approach for exploring speech and text messaging. Linguistics Vanguard, 9(s4), 389-398.

Labov, W. (1973). Sociolinguistic patterns (No. 4). University of Pennsylvania press.

Lindén, K., Jauhiainen, T., Lennes, M., Kurimo, M., Rossi, A., Kurki, T., & Pitkänen, O. (2022). Donate Speech: Collecting and Sharing a Large-Scale Speech Database for Social Sciences, Humanities and Artificial Intelligence Research and Innovation. V CLARIN: the infrastructure for language resources (Digital Linguistics; Vol. 1). De Gruyter. doi: 10.1515/9783110767377-019

Lyding, V., Nicolas, L., & König, A. (2022). About the applicability of combining implicit crowdsourcing and language learning for the collection of NLP datasets. V Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022 (str. 46–57). Pridobljeno s https://aclanthology.org/2022.nidcp-1.8.pdf

Mlinar, Z. (2021). Kaj nam prinašata koncept in gibanje občanska znanost/Citizen Science? Uveljavljanje raziskovanja kot sestavine vsakdanjega življenja. Casopis za Kritko Znanosti, Domisljijo in Novo Antropologijo (Journal for the Critique of Science, Imagination & New Anthropology), 49(282).

Neale, S., Spasic, I., Needs, J., Watkins, G., Morris, S., Fitzpatrick, T., ... & Knight, D. (2017). The CorCenCC crowdsourcing app: A bespoke tool for the user-driven creation of the national corpus of contemporary Welsh. V Corpus Linguistics Conference, Birmingham. Pridobljeno s https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper273.pdf

Nisbett, A. (2003). Sound Studio: Audio Techniques for Radio, Television, Film and Recording. Routledge.

Poesio, M., Chamberlain, J., & Kruschwitz, U. (2017). Crowdsourcing. Handbook of linguistic annotation (str. 277–295).

Robinson, L. D., Cawthray, J. L., West, S. E., Bonn, A., & Ansine, J. (2018). Ten principles of citizen science. In Citizen science: Innovation in open science, society and policy (pp. 27-40). UCL Press.

Rutten, M., Minkman, E., & van der Sanden, M. (2017). How to get and keep citizens involved in mobile crowd sensing for water management? A review of key success factors and motivational aspects. Wiley Interdisciplinary Reviews: Water, 4(4), e1218. Pridobljeno s https://wires.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/wat2.1218

Szabó, G., Fazakas, N., Kocsis, Z., Krizsai, F., & Vargha, F. S. (2025). Researching dialects with high school students: a citizen science approach. Linguistics Vanguard, (0).

Tondello, G. F., Wehbe, R. R., Diamond, L., Busch, M., Marczewski, A., & Nacke, L. E. (2016). The gamification user types hexad scale. V Proceedings of the 2016 annual symposium on computer-human interaction in play (str. 229–243). Pridobljeno s https://dl.acm.org/doi/pdf/10.1145/2967934.2968082

Ueberwasser, S., & Stark, E. (2017). What’s up, Switzerland? A corpus-based research project in a multilingual country. Linguistik online, 84(5).

Van Leeuwen, D. A., Hinskens, F., Martinovic, B., Van Hessen, A., Grondelaers, S., & Orr, R. (2016). Sprekend Nederland: A heterogeneous speech data collection. Computational Linguistics in the Netherlands Journal, 6, 21–38. Pridobljeno s https://www.clinjournal.org/clinj/article/view/62/55

Verdonik, D., & Gostenčnik, J. (2024). Smernice za zbiranje podatkov za govorne vire. Univerza, Fakulteta za elektrotehniko, računalništvo in informatiko. https://mezzanine.um.si/rezultati/#tehni%C4%8Dne-smernice

Verdonik, D., & Maučec, M. S. (2017). A speech corpus as a source of lexical information. International Journal of Lexicography, 30(2), 143–166.

Verdonik, D. (2008). Označevanje vrste diskurznih označevalcev. V T. Erjavec in J. Žganec Gros (ur.), Zbornik šeste konference Jezikovne tehnologije, 16.–17. oktober 2008, Ljubljana (Vol. 12, str. 25). Pridobljeno s https://nl.ijs.si/isjt08/IS-LTC08-Proceedings.pdf#page=33

Vohland, K. (2021). The Science of Citizen Science.

Wertheim, S. (2006). Cleaning up for company: Using participant roles to understand fieldworker effect. Language in Society, 35(5), 707-727.

Zhang, C., Jepson, K., & Chuang, Y. Y. (2024). Investigating differences in lab-quality and remote recording methods with dynamic acoustic measures. arXiv preprint arXiv:2404.17022.

Izdano

12.12.2025

Podrobnosti o monografski publikaciji

THEMA Subject Codes (93)

C

ISBN-13 (15)

978-961-299-093-0

COBISS.SI ID (00)

Date of first publication (11)

12.12.2025

Kako citirati

Bizjak, A. (2025). Pridobivanje govornih virov: Prednosti in pomanjkljivosti različnih pristopov. Univerzitetna založba Univerze v Mariboru. https://doi.org/10.18690/um.feri.11.2025