Pridobivanje govornih virov: Prednosti in pomanjkljivosti različnih pristopov

Authors

Andreja Bizjak
University of Maribor, Faculty of Electrical Engineering and Computer Science
https://orcid.org/0009-0003-4951-5806

Keywords:

speech resource collection, remote speech recording, spontaneous speech, spoken Slovenian, digital humanities

Synopsis

Collection of Speech Resources: Advantages and Limitations of Different Approaches. This study systematically examines various approaches to collecting speech resources for Slovenian – from laboratory and field recordings to more contemporary online methods such as GWAPs, Collect4NLP, crowdsourcing, and citizen science. Through a comparative analysis that considers sound quality, cost, scalability, and speech spontaneity, it reveals the advantages, limitations, and implementation challenges of each approach. The work provides practical recommendations for researchers and developers of speech technologies who face the challenges of remote recording. By integrating global trends and international practices with the Slovenian context, it contributes to the development of digital speech resources and the advancement of language technologies for under-resourced languages.

Downloads

Download data is not yet available.

Author Biography

Andreja Bizjak, University of Maribor, Faculty of Electrical Engineering and Computer Science

Asist. mag. Andreja Bizjak is employed at the Faculty of Electrical Engineering and Computer Science, University of Maribor, where she teaches in the Media Communications study program and works as a researcher on projects in the field of speech and language technologies. Her research interests include corpus linguistics, speech resources, and methodologies for speech data collection. She has published several scientific papers and contributed to the development of the ARTUR speech database, thereby supporting the advancement of the Slovenian language in the digital environment.

Maribor, Slovenia. E-mail: andreja.bizjak1@um.si

References

Arhar Holdt, Š., Logar, N., Pori, E., & Kosem, I. (2021). “Game of Words”: Play the Game, Clean the Database. EURALEX XIX. Pridobljeno s XIX-Euralex-Proceedings-Lexicography-for-Inclusion.pdf

Awan, S. N., Shaikh, M. A., Awan, J. A., Abdalla, I., Lim, K. O., & Misono, S. (2023). Smartphone recordings are comparable to “gold standard” recordings for acoustic measurements of voice. Journal of voice.

Bizjak, A. (2025). Vloga občanske znanosti pri množičnem zbiranju govornih virov v slovenščini. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave, 13(1), 58-103.

Bonney, R., Phillips, T. B., Ballard, H. L., & Enck, J. W. (2016). Can citizen science enhance public understanding of science?. Public understanding of science, 25(1), 2–16. doi: 10.1177/0963662515607406

Čibej, J., Robida, N., & Krek, S. (2024). Nadgradnja Digitalne slovarske baze za slovenščino in Slovenskega oblikoslovnega leksikona Sloleks s podatki o govorjeni slovenščini: načrti in cilji. V M. Krajnc Ivič (ur.), Stanje in perspektive uporabe govornih virov v raziskavah govora (str. 27–39). Univerza v Mariboru, Univerzitetna založba. https://press.um.si/index.php/ump/catalog/book/898/chapter/46

Dorn, A., Stocker, R., & Stöckle, P. (2024). Old dialect words through the ages–the ABCs of dialect project. ARPHA Proceedings, 6, 213-216.

Entringer, N., Gilles, P., Martin, S., & Purschke, C. (2021). Schnëssen. Surveying language dynamics in Luxembourgish with a mobile research app. Linguistics Vanguard, 7(s1), 20190031.

Eskenazi, M., Levow, G. A., Meng, H., Parent, G., & Suendermann, D. (2013). Crowdsourcing for speech processing: Applications to data collection, transcription and assessment. John Wiley & Sons.

Fiumara, J., Cieri, C., Wright, J., & Liberman, M. (2020). LanguageARC: Developing language resources through citizen linguistics. In Proceedings of the LREC 2020 Workshop Citizen Linguistics in Language Resource Development (CLLRD 2020).

Garaus, C., Garaus, M., & Wagner, U. M. (2024). Getting users involved in idea crowdsourcing initiatives: An experimental approach to stimulate intrinsic motivation and intention to submit. IEEE Transactions on Engineering Management, 71, 3700–3711. Pridobljeno s https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10387737

Gershenfeld, N. (2011). Physics of the future: how science will shape human destiny and our daily lives by the year 2100. Physics Today, 64(10), 56–56.

Heinisch, B., Oswald, K., Weißpflug, M., Shuttleworth, S., Belknap, G. (2021). Citizen Humanities. In: Vohland, K., et al. The Science of Citizen Science. Springer, Cham. https://doi.org/10.1007/978-3-030-58278-4_6

Heinisch, B. (2022). The Influence of Intrinsic and Extrinsic Motivation on the Creation of Language Resources in a Citizen Linguistics Project about Lexicography. In Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022 (pp. 58-63).

Hutin, M., & Allassonnière-Tang, M. (2022). Crowd-sourcing for less-resourced languages: Lingua libre for Polish. In 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages (SIGUL 2022).

Kaufmann, N., Schulze, T., & Veit, D. (2011). More than fun and money. worker motivation in crowdsourcing–a study on mechanical turk. https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/45694/file/More_than_fun_and_money_Worker_Motivation_in_Crowd.pdf

Knight, D., Morris, S., Fitzpatrick, T., Rayson, P., Spasić, I., & Thomas, E. M. (2020). The national corpus of contemporary Welsh: Project report| Y corpws cenedlaethol Cymraeg cyfoes: adroddiad y prosiect. Pridobljeno s https://arxiv.org/abs/2010.05542

Koreinik, K., Mandel, A., Pilvik, M. L., Praakli, K., & Vihman, V. A. (2024). Outsourcing teenage language: A participatory approach for exploring speech and text messaging. Linguistics Vanguard, 9(s4), 389-398.

Labov, W. (1973). Sociolinguistic patterns (No. 4). University of Pennsylvania press.

Lindén, K., Jauhiainen, T., Lennes, M., Kurimo, M., Rossi, A., Kurki, T., & Pitkänen, O. (2022). Donate Speech: Collecting and Sharing a Large-Scale Speech Database for Social Sciences, Humanities and Artificial Intelligence Research and Innovation. V CLARIN: the infrastructure for language resources (Digital Linguistics; Vol. 1). De Gruyter. doi: 10.1515/9783110767377-019

Lyding, V., Nicolas, L., & König, A. (2022). About the applicability of combining implicit crowdsourcing and language learning for the collection of NLP datasets. V Proceedings of the 2nd Workshop on Novel Incentives in Data Collection from People: models, implementations, challenges and results within LREC 2022 (str. 46–57). Pridobljeno s https://aclanthology.org/2022.nidcp-1.8.pdf

Mlinar, Z. (2021). Kaj nam prinašata koncept in gibanje občanska znanost/Citizen Science? Uveljavljanje raziskovanja kot sestavine vsakdanjega življenja. Casopis za Kritko Znanosti, Domisljijo in Novo Antropologijo (Journal for the Critique of Science, Imagination & New Anthropology), 49(282).

Neale, S., Spasic, I., Needs, J., Watkins, G., Morris, S., Fitzpatrick, T., ... & Knight, D. (2017). The CorCenCC crowdsourcing app: A bespoke tool for the user-driven creation of the national corpus of contemporary Welsh. V Corpus Linguistics Conference, Birmingham. Pridobljeno s https://www.birmingham.ac.uk/Documents/college-artslaw/corpus/conference-archives/2017/general/paper273.pdf

Nisbett, A. (2003). Sound Studio: Audio Techniques for Radio, Television, Film and Recording. Routledge.

Poesio, M., Chamberlain, J., & Kruschwitz, U. (2017). Crowdsourcing. Handbook of linguistic annotation (str. 277–295).

Robinson, L. D., Cawthray, J. L., West, S. E., Bonn, A., & Ansine, J. (2018). Ten principles of citizen science. In Citizen science: Innovation in open science, society and policy (pp. 27-40). UCL Press.

Rutten, M., Minkman, E., & van der Sanden, M. (2017). How to get and keep citizens involved in mobile crowd sensing for water management? A review of key success factors and motivational aspects. Wiley Interdisciplinary Reviews: Water, 4(4), e1218. Pridobljeno s https://wires.onlinelibrary.wiley.com/doi/pdfdirect/10.1002/wat2.1218

Szabó, G., Fazakas, N., Kocsis, Z., Krizsai, F., & Vargha, F. S. (2025). Researching dialects with high school students: a citizen science approach. Linguistics Vanguard, (0).

Tondello, G. F., Wehbe, R. R., Diamond, L., Busch, M., Marczewski, A., & Nacke, L. E. (2016). The gamification user types hexad scale. V Proceedings of the 2016 annual symposium on computer-human interaction in play (str. 229–243). Pridobljeno s https://dl.acm.org/doi/pdf/10.1145/2967934.2968082

Ueberwasser, S., & Stark, E. (2017). What’s up, Switzerland? A corpus-based research project in a multilingual country. Linguistik online, 84(5).

Van Leeuwen, D. A., Hinskens, F., Martinovic, B., Van Hessen, A., Grondelaers, S., & Orr, R. (2016). Sprekend Nederland: A heterogeneous speech data collection. Computational Linguistics in the Netherlands Journal, 6, 21–38. Pridobljeno s https://www.clinjournal.org/clinj/article/view/62/55

Verdonik, D., & Gostenčnik, J. (2024). Smernice za zbiranje podatkov za govorne vire. Univerza, Fakulteta za elektrotehniko, računalništvo in informatiko. https://mezzanine.um.si/rezultati/#tehni%C4%8Dne-smernice

Verdonik, D., & Maučec, M. S. (2017). A speech corpus as a source of lexical information. International Journal of Lexicography, 30(2), 143–166.

Verdonik, D. (2008). Označevanje vrste diskurznih označevalcev. V T. Erjavec in J. Žganec Gros (ur.), Zbornik šeste konference Jezikovne tehnologije, 16.–17. oktober 2008, Ljubljana (Vol. 12, str. 25). Pridobljeno s https://nl.ijs.si/isjt08/IS-LTC08-Proceedings.pdf#page=33

Vohland, K. (2021). The Science of Citizen Science.

Wertheim, S. (2006). Cleaning up for company: Using participant roles to understand fieldworker effect. Language in Society, 35(5), 707-727.

Zhang, C., Jepson, K., & Chuang, Y. Y. (2024). Investigating differences in lab-quality and remote recording methods with dynamic acoustic measures. arXiv preprint arXiv:2404.17022.

Downloads

Published

December 12, 2025

Details about this monograph

THEMA Subject Codes (93)

C

ISBN-13 (15)

978-961-299-093-0

COBISS.SI ID (00)

Date of first publication (11)

2025-12-12

How to Cite

Bizjak, A. (2025). Pridobivanje govornih virov: Prednosti in pomanjkljivosti različnih pristopov. University of Maribor Press. https://doi.org/10.18690/um.feri.11.2025