How is the data used for speech recognition collected and prepared? -


as far can tell, speech recognition implementations rely on binary files contain acoustic models of language trying 'recognize'.

so how people compile these models?

one transcribe lots of speeches manually, takes lot of time. even then, when given audio file containing speech , full transcription of in text file, individual word pronunciations still need somehow separated. match parts of audio correspond text 1 still needs speech recognition.

how gathered? if 1 handed on thousands of hours' worth of audio files , full transcriptions (disregarding problem of having transcribe manually), how can audio split @ right intervals 1 word ends , begins? wouldn't software producing these acoustic models already have capable of speech recognition?

so how people compile these models?

you can learn process going through cmusphinx acoustic model training tutorial

one transcribe lots of speeches manually, takes lot of time.

this correct, model preparation takes lot of time. speech transcribed manually. can take transcribed speech movies subtitles or transcribed lectures or audiobooks , use them training.

even then, when given audio file containing speech , full transcription of in text file, individual word pronunciations still need somehow separated. match parts of audio correspond text 1 still needs speech recognition.

you need separate speech on sentences of 5-20 seconds long, not on words. speech recognition training can learn model sentences called utterances, can segment on words automatically. segmentation done in unsupervised way, clustering, not require system recognize speech, detects chunks of similar structure in sentence , assigns them phones. makes speech training way easier if you'd train on separate words.

how gathered? if 1 handed on thousands of hours' worth of audio files , full transcriptions (disregarding problem of having transcribe manually), how can audio split @ right intervals 1 word ends , begins? wouldn't software producing these acoustic models have capable of speech recognition?

you need initialize system manually transcribed recording database of size of 50-100 hours. can read examples here. many popular languages english, french, german, russian such databases exist. in progress in dedicated resource.

once have initial database can take large set of videos , segment them using existing model. helps create databases of thousands of hours. example such database trained ted talks, can read here.


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -