December 10: Research presentation, Seung-Eun Kim

Exploring the perceptual similarity space for human and TTS speech
Recent studies by Chernyak et al. (2024) and Kim et al. (2025) introduced a novel approach of estimating holistic acoustic similarity between speech samples using a pre-trained self-supervised machine learning model. These studies demonstrated that an L2 English talker’s perceptual distance from L1 English talkers — i.e., how (dis)similar the L2 talker’s speech is from that of L1 talkers — accounts for variation in L2 talker intelligibility.
In this talk, I present two series of follow-up analyses that extend this line of work (introduced as time allows). The first examines the internal structure of the model’s perceptual similarity space by evaluating each L2 talker’s distance from other L2 talkers and how these distances relate to intelligibility. The second expands the framework to include synthetic voices generated by text-to-speech (TTS) systems, allowing us to test whether TTS-generated speech can serve as a viable tool for modeling intelligibility and how the acoustic properties of TTS voices compare with those of human talkers.
Together, these analyses provide a deeper understanding of what the model’s similarity space captures and how it can be utilized to investigate human and machine speech.

November 12: Practice presentations (Psychonomics)

Three Phonatics members will practice their presentations for the upcoming Psychonomics conference.

Seung-Eun Kim, Ann Bradlow [presenter], Matt Goldrick: Generalization of perceptual adaptation across second-language (L2) English talkers: The (limited) roles of talker intelligibility and training-test talker similarity.
Seung-Eun Kim [presenter], Ann Bradlow, Matt Goldrick: Exposure Conditions Facilitating Perceptual Adaptation to a Second-Language Talker in Noise. 
Jennifer Dibbern [presenter], Tamar Gollan, Dalia Garcia, Jessie Quinn, Matt Goldrick: Automated Analysis of Code-Mixed Speech: Investigating Costs of Language Mixing in Fully Connected Speech. 

October 29: Practice presentations (NWAV)

Four members of our Phonatics community will practice their presentations for the upcoming NWAV (New Ways of Analyzing Variation) conference.

Jenn Dibbern (Ling): Community Ideologies and the Perception of Speech: Localized Social Meaning in Bilingual Language Processing
Ke Lin (Ling): Bilinguals Differ in Weighing Social and Acoustic Cues During L2 Speech Processing: Evidence from Eye-Tracking
Raef Khan (Ling): How “Extreme” is Extreme Pain?: Patient Identity and Intensifier Usage Affects Physicians’ Perceptions of Pain
Emma Wilkinson (Ling): The social and linguistic landscape of Chicago-area Japanese Americans

February 12: Research presentation, Abhijit Roy (CSD)

The Statistical Implications of Auditory Spectrum
Auditory spectra are a core element of speech, hearing, and language research, underpinning representations of hearing ability, frequency characteristics in stimuli, and microphone responses, among other data points. However, comparing these spectra presents unique statistical challenges due to the distinct properties of the human auditory system and of acoustic spectra. In particular, non-linear frequency resolution and unequal bandwidths across center frequencies complicate straightforward bin-by-bin measurements. Further, non-linear loudness and power-law relationships (e.g., Stevens’ power law, Fletcher–Munson curves) mean that spectra appearing numerically similar can still sound perceptually different. Correlations among neighboring frequency bins, often introduced by harmonic signals and formant structures, add another layer of complexity, yielding high-dimensional, correlated distributions that cannot be treated as independent and identically distributed. This study explores the statistical characteristics of auditory spectra with two main objectives: (1) evaluating differences in hearing ability among individuals, and (2) comparing distinct auditory spectra. We review standard statistical methods commonly used in hearing and speech sciences and propose enhancements that streamline spectral comparisons, ultimately increasing validity and enabling more detailed interpretation of auditory data.

February 5: Research presentation, Zhe-chen Guo

Extended high frequency information improves phoneme recognition: Evidence from automatic speech recognition
Speech information in the extended high-frequency (EHF) range (>8 kHz) is often overlooked in hearing assessments and automatic speech recognition (ASR). However, accumulated evidence has demonstrated that EHFs contribute to speech perception, although it remains unclear whether such benefit can arise from improved phoneme recognition. We addressed this question by testing how higher-frequency content affects ASR performance in simulated spatial listening situations. English speech from the VCTK corpus was resynthesized with head-related transfer functions to create spatial audio in which target speech was masked by a competing talker separated by 20°, 45°, 80°, or 120° azimuth at target-to-masker ratios (TMRs) from +3 to −12 dB. A CNN-BiLSTM phoneme decoder was trained on cochleagram representations of the resynthesized speech, which was either broadband or low-pass filtered at 6 or 8 kHz. Phoneme recognition was no more accurate for broadband speech compared with low-pass filtered speech in quiet. Yet, in the presence of a masker, higher-frequency energy improved recognition across degrees of spatial separation, particularly at TMRs ≤ −9 dB. Furthermore, removing EHFs disproportionately increased errors for consonants over vowels. These findings demonstrate EHFs’ role in phoneme recognition under adverse conditions, highlighting the importance of EHFs in audiometric evaluations and ASR development.

January 15, 2025: Research presentation, Jennifer Cole

Categories and gradience in intonation
Differences in intonation among languages and dialects are readily noticed but less easily described. What is the ‘shape’ of phrasal pitch contours, analyzed in terms of their component phonological features or in acoustic F0 measurements? How does intonation function to mark the structure of phrases and larger discourse units, or distinctions in semantic or pragmatic meaning? The goal of a linguistic theory of intonation is to establish a framework in which the form and functions of intonation can be analyzed and compared across languages and speakers. This is a surprisingly difficult task. Analyzing the function of intonational expressions calls for preliminary decisions about segmentation, measurement and encoding– which interval(s) of a continuous pitch signal are associated with a particular meaning or structure, which aspects of the dynamic F0 signal encode that function, and what are the features of encoding? Even for English, arguably one of the most studied intonation systems, there is ongoing debate over these very questions, resulting in a knowledge bottleneck that stymies scientific progress on intonation and its communicative function.
In this talk I report on my recent work addressing this central challenge for American English: What are the characteristics of phrasal pitch patterns that are reliably perceived and produced as distinct and interpreted differently, by speakers of the language?  I present work (with Jeremy Steffman, U Edinburgh) from a series of studies that examine intonational form through imitations of 16  intonational “tunes” of English, under varying task conditions that tap memory representations of model tunes presented auditorily. Analyses of dynamic F0 patterns from five experiments converge on finding a primary dichotomy between high-rising and low-falling tunes, with secondary distinctions in meaning corresponding to F0 shape variation within the two primary tune classes. Time allowing, I will briefly discuss related findings from parallel streams of research in my lab investigating intonational form and its pragmatic function related to interpretations of asking/telling and scalar ratings of speaker surprise (work with Thomas Sostarics and Rebekah Stanhope). Implications of the joint findings from these studies are discussed for a theory of categorical and gradient associations of intonational form and function.