February 12: Research presentation, Abhijit Roy (CSD)

The Statistical Implications of Auditory Spectrum
Auditory spectra are a core element of speech, hearing, and language research, underpinning representations of hearing ability, frequency characteristics in stimuli, and microphone responses, among other data points. However, comparing these spectra presents unique statistical challenges due to the distinct properties of the human auditory system and of acoustic spectra. In particular, non-linear frequency resolution and unequal bandwidths across center frequencies complicate straightforward bin-by-bin measurements. Further, non-linear loudness and power-law relationships (e.g., Stevens’ power law, Fletcher–Munson curves) mean that spectra appearing numerically similar can still sound perceptually different. Correlations among neighboring frequency bins, often introduced by harmonic signals and formant structures, add another layer of complexity, yielding high-dimensional, correlated distributions that cannot be treated as independent and identically distributed. This study explores the statistical characteristics of auditory spectra with two main objectives: (1) evaluating differences in hearing ability among individuals, and (2) comparing distinct auditory spectra. We review standard statistical methods commonly used in hearing and speech sciences and propose enhancements that streamline spectral comparisons, ultimately increasing validity and enabling more detailed interpretation of auditory data.

February 5: Research presentation, Zhe-chen Guo

Extended high frequency information improves phoneme recognition: Evidence from automatic speech recognition
Speech information in the extended high-frequency (EHF) range (>8 kHz) is often overlooked in hearing assessments and automatic speech recognition (ASR). However, accumulated evidence has demonstrated that EHFs contribute to speech perception, although it remains unclear whether such benefit can arise from improved phoneme recognition. We addressed this question by testing how higher-frequency content affects ASR performance in simulated spatial listening situations. English speech from the VCTK corpus was resynthesized with head-related transfer functions to create spatial audio in which target speech was masked by a competing talker separated by 20°, 45°, 80°, or 120° azimuth at target-to-masker ratios (TMRs) from +3 to −12 dB. A CNN-BiLSTM phoneme decoder was trained on cochleagram representations of the resynthesized speech, which was either broadband or low-pass filtered at 6 or 8 kHz. Phoneme recognition was no more accurate for broadband speech compared with low-pass filtered speech in quiet. Yet, in the presence of a masker, higher-frequency energy improved recognition across degrees of spatial separation, particularly at TMRs ≤ −9 dB. Furthermore, removing EHFs disproportionately increased errors for consonants over vowels. These findings demonstrate EHFs’ role in phoneme recognition under adverse conditions, highlighting the importance of EHFs in audiometric evaluations and ASR development.

January 15, 2025: Research presentation, Jennifer Cole

Categories and gradience in intonation
Differences in intonation among languages and dialects are readily noticed but less easily described. What is the ‘shape’ of phrasal pitch contours, analyzed in terms of their component phonological features or in acoustic F0 measurements? How does intonation function to mark the structure of phrases and larger discourse units, or distinctions in semantic or pragmatic meaning? The goal of a linguistic theory of intonation is to establish a framework in which the form and functions of intonation can be analyzed and compared across languages and speakers. This is a surprisingly difficult task. Analyzing the function of intonational expressions calls for preliminary decisions about segmentation, measurement and encoding– which interval(s) of a continuous pitch signal are associated with a particular meaning or structure, which aspects of the dynamic F0 signal encode that function, and what are the features of encoding? Even for English, arguably one of the most studied intonation systems, there is ongoing debate over these very questions, resulting in a knowledge bottleneck that stymies scientific progress on intonation and its communicative function.
In this talk I report on my recent work addressing this central challenge for American English: What are the characteristics of phrasal pitch patterns that are reliably perceived and produced as distinct and interpreted differently, by speakers of the language?  I present work (with Jeremy Steffman, U Edinburgh) from a series of studies that examine intonational form through imitations of 16  intonational “tunes” of English, under varying task conditions that tap memory representations of model tunes presented auditorily. Analyses of dynamic F0 patterns from five experiments converge on finding a primary dichotomy between high-rising and low-falling tunes, with secondary distinctions in meaning corresponding to F0 shape variation within the two primary tune classes. Time allowing, I will briefly discuss related findings from parallel streams of research in my lab investigating intonational form and its pragmatic function related to interpretations of asking/telling and scalar ratings of speaker surprise (work with Thomas Sostarics and Rebekah Stanhope). Implications of the joint findings from these studies are discussed for a theory of categorical and gradient associations of intonational form and function.

November 15, 2023: Abhijit Roy

Considering language-specificity in hearing aid prescription algorithms

Current standards in hearing aid signal processing are not language-specific. A language aggregated long term averaged speech spectrum (LTASS) forms the core of much reasoning behind hearing aid amplification protocols and clinical procedures. More recent studies have found this reasoning to be contentious. Various recording procedures (among other factors) can lead to spectral coloration of the signal. The aggregated LTASS in use may suffer from such colorations as well. Here, a language aggregated LTASS was derived from the ALLSTAR corpus and also from the GoogleTTS AI speech corpus. Results were compared to the original aggregated LTASS. The impact of recording decisions on the expected speech spectrum is also discussed.

 

November 8, 2023: Lisa Davidson

The phonetic details of word-level prosodic structure: evidence from Hawaiian

Previous research has shown that the segmental and phonetic realization of consonants can be sensitive to word-internal prosodic and metrical boundaries (e.g., Vaysman 2009, Bennett 2018, Shaw 2007). At the same time, other work has shown that prosodic prominence, such as stressed or accented syllables, has a separate effect on phonetic implementation (e.g. Cho and Keating 2009, Garellek 2014, Katsika and Tsai 2021). This talk focuses on the word-level factors affecting glottal and oral stops in Hawaiian. We first investigate whether word-internal prosodic or metrical factors, or prosodic prominence such as stressed syllables account for the realization of glottal stop, and then we extend the same analysis to the realization of voice onset time (VOT) in oral stops. Data comes from the 1970s-80s radio program Ka Leo Hawaiʻi. Using a variant of Parker Jones’ (2010) computational prosodic grammar, stops were automatically coded for (lexical) word position, syllable stress, syllable weight, and Prosodic Word position. Results show that word-internal metrical structures do condition phonetic realization, but prosodic prominence does not for either kind of stop. Rather, what is often taken to be the “stronger” articulations (i.e. full closure in glottal stops and longer VOT in oral stops) are instead associated with word-internal boundaries or other prosodically weak positions, which may reflect the recruitment of phonetic correlates to disambiguate or enhance potentially less perceptible elements in Hawaiian. (Work in collaboration with ‘Ōiwi Parker Jones)

October 4, 2023: Midam Kim

Trusting Unreliable Genius

Broad availability of Large Language Models is revolutionizing how conversational AI systems can interact with humans, yet the factors that influence user trust in conversational systems, especially systems prone to errors or ‘hallucinations’, remain complex and understudied. In this talk titled “Trusting Unreliable Genius”, we delve into the nuances of trust in AI, focusing on trustability factors like competency, benevolence, and reliability. We begin by examining human conversation dynamics, including the role of interactive alignment and Gricean Maxims. These principles are then juxtaposed with Conversational AI interactions with several state-of-the-art LLM chabots, offering insights into how trust is cultivated or eroded in this context. We also shed light on the necessity for transparency in AI development and deployment, the need for continuous improvement in reliability and predictability, and the significance of aligning AI with user values and ethical considerations. Building trust in AI is a multifaceted process involving a blend of technology, sociology, and ethics. We invite you to join us as we unravel the complexities of trust in Conversational AI and explore strategies to enhance it.