Exploring the perceptual similarity space for human and TTS speech
Recent studies by Chernyak et al. (2024) and Kim et al. (2025) introduced a novel approach of estimating holistic acoustic similarity between speech samples using a pre-trained self-supervised machine learning model. These studies demonstrated that an L2 English talker’s perceptual distance from L1 English talkers — i.e., how (dis)similar the L2 talker’s speech is from that of L1 talkers — accounts for variation in L2 talker intelligibility.
In this talk, I present two series of follow-up analyses that extend this line of work (introduced as time allows). The first examines the internal structure of the model’s perceptual similarity space by evaluating each L2 talker’s distance from other L2 talkers and how these distances relate to intelligibility. The second expands the framework to include synthetic voices generated by text-to-speech (TTS) systems, allowing us to test whether TTS-generated speech can serve as a viable tool for modeling intelligibility and how the acoustic properties of TTS voices compare with those of human talkers.
Together, these analyses provide a deeper understanding of what the model’s similarity space captures and how it can be utilized to investigate human and machine speech.
