Nadgradnja digitalne slovarske baze za slovenščino in slovenskega oblikoslovnega leksikona sloleks s podatki o govorjeni slovenščini: načrti in cilji: Jaka Čibej, Nejc Robida, Simon Krek
Synopsis
Extending the Digital Dictionary Database of Slovene and the Sloleks Morphological Lexicon of Slovene with Spoken Slovene Data: Plans and Goals. This paper presents plans and goals for extending language resources such as the Digital Dictionary Database of Slovene and the Sloleks Morphological Lexicon of Slovene with data on spoken Slovene – particularly typically spoken vocabulary – for language technology purposes (e.g. speech recognition and synthesis). After a brief overview of related work, we present the material we will use for this purpose (the GOS and JANES corpora) as well as the main challenges we encounter when incorporating non-standard vocabulary into existing resources that have so far been mainly intended for written standard Slovene. In addition to the issue of canonical forms (e.g. lavfati/laufati), we also address the issues of non-standard phonemes ([ˈɡɾɔːza] vs. [ˈɦɾɔːza]), non-standard pronunciations of standard word forms (mislim [ˈmiːslim] → [ˈmiːsləm]) and non-standard morphology (Mihatov, opravičavam). The challenges will be described in the framework of the MEZZANINE project, and the solutions will be documented in guidelines that will enable the systematic extension of existing language resources with typical spoken lexis.