February 5: Research presentation, Zhe-chen Guo

Extended high frequency information improves phoneme recognition: Evidence from automatic speech recognition
Speech information in the extended high-frequency (EHF) range (>8 kHz) is often overlooked in hearing assessments and automatic speech recognition (ASR). However, accumulated evidence has demonstrated that EHFs contribute to speech perception, although it remains unclear whether such benefit can arise from improved phoneme recognition. We addressed this question by testing how higher-frequency content affects ASR performance in simulated spatial listening situations. English speech from the VCTK corpus was resynthesized with head-related transfer functions to create spatial audio in which target speech was masked by a competing talker separated by 20°, 45°, 80°, or 120° azimuth at target-to-masker ratios (TMRs) from +3 to −12 dB. A CNN-BiLSTM phoneme decoder was trained on cochleagram representations of the resynthesized speech, which was either broadband or low-pass filtered at 6 or 8 kHz. Phoneme recognition was no more accurate for broadband speech compared with low-pass filtered speech in quiet. Yet, in the presence of a masker, higher-frequency energy improved recognition across degrees of spatial separation, particularly at TMRs ≤ −9 dB. Furthermore, removing EHFs disproportionately increased errors for consonants over vowels. These findings demonstrate EHFs’ role in phoneme recognition under adverse conditions, highlighting the importance of EHFs in audiometric evaluations and ASR development.

Leave a Reply

Your email address will not be published. Required fields are marked *