Interspeech 2025

Interspeech 2025 Keynotes

Prof. Dr. Roger K. Moore

ISCA Medalist 2025, Chair of Spoken Language Processing, Head of Speech & Hearing Research Group (SpandH), Vocal Interactivity Lab (VILab), Sheffield Robotics

School of Computer Science, University of Sheffield

ISCA Medal for Scientific Achievement Keynote

Title: From Talking and Listening Devices to Intelligent Communicative Machines

Abstract: Having been 'in the business' of speech technology for over 50 years, I've had the pleasure of witnessing (and being involved first-hand) in many of the astounding developments that have led to the incredible solutions we have today. Indeed, my involvement in the field of spoken language has been somewhat of a love affair, and it's been a huge honour and privilege to have been working with so many excellent researchers on "the most sophisticated behaviour of the most complex organism in the known universe"! Although I've always been heavily committed to the establishment of machine learning approaches to spoken language processing - including publishing one of the first papers on the application of artificial neural networks to automatic speech recognition - my approach has always been one of attempting to uncover the underlying mechanisms of 'intelligent' (speech-based) interaction, on the basis that living systems are remarkably data-efficient in their learning. This talk will both look back (rather a long way) and look forward, asking the question how did we get here and where are we going? I hope that some of my insights may inspire others to follow a similar path.

Biography: Prof. Moore (http://staffwww.dcs.shef.ac.uk/people/R.K.Moore/) has over 50 years’ experience in Speech Technology R&D and, although an engineer by training, much of his research has been based on insights from human speech perception and production. He studied Computer & Communications Engineering at the University of Essex and was awarded the B.A. (Hons.) degree in 1973. He subsequently received the M.Sc.(Res.) and Ph.D. degrees from the same university in 1975 and 1977 respectively, both theses being on the topic of automatic speech recognition. After a period of post-doctoral research in the Phonetics Department at University College London, Prof. Moore was recruited in 1980 to establish a speech recognition research team at the Royal Signals and Radar Establishment (RSRE) in Malvern. As Head of the UK Government's Speech Research Unit from 1985 to 1999, he was responsible for the development of the Aurix range of speech technology products and the subsequent formation of 20/20 Speech Ltd. Since 2004 he has been Professor of Spoken Language Processing at the University of Sheffield, and also holds Visiting Chairs at Bristol Robotics Laboratory and University College London Psychology & Language Sciences. Since joining Sheffield, his research has focused on understanding the fundamental principles of speech-based interaction, and in 2017 he initiated the first in the series of international workshops on ‘Vocal Interactivity in-and-between Humans, Animals and Robots' (VIHAR).

As President of both the European Speech Communication Association (ESCA) and Permanent Council of the International Conference on Spoken Language Processing (PC-ICSLP) from 1997, Prof. Moore pioneered their integration to form the International Speech Communication Association (ISCA). He was subsequently General Chair for INTERSPEECH-2009 and ISCA Distinguished Lecturer during 2014-15. He has received several awards, including the UK Institute of Acoustics Tyndall Medal for “distinguished work in the field of speech research and technology“, the NATO RTO Scientific Achievement Award for “repeated contribution in scientific and technological cooperation”, the LREC Antonio Zampoli Prize for "Outstanding Contributions to the Advancement of Language Resources & Language Technology Evaluation within Human Language Technologies", and the ISCA Special Service Medal for "Service in the establishment, leadership and international growth of ISCA". Prof. Moore is the current Editor-in-Chief of Computer Speech & Language, and Associate Editor for Speech Communication, Languages, the Journal of Future Robot Life, and Frontiers in Robotics and AI (Computational Intelligence in Robotics).

Prof. Dr. Alexander Waibel

Director of InterACT, Carnegie Mellon University & Institute for Anthropomatics and Robotics, Interactive Systems Labs in Karlsruhe Institute for Technology (KIT)

Title: From Speech Science to Language Transparence

Abstract: Breaking down language barriers has been a dream of centuries. Seemingly unsolvable, we are now lucky to live in the one generation that makes global communication a common reality. Such global transformation was not thought to be possible, and has only become possible through revolutionary advances in AI, language and speech processing. Indeed, the challenges of processing spoken language have required, caused, guided and motivated the most impactful advances in AI. During a time of knowledge-based speech and language processing, I became convinced that only data-driven machine learning can reasonably be expected to handle the complexities, the uncertainty, and variability of communication, and that only latent learned representations would be able to abstract and fuse new and complementary knowledge. It turned out to work beyond our wildest expectations. Starting with small shift-invariant time-delay neural networks (TDNN’s) for phonemes, we would eventually scale neural systems to massive speech, language and interpretating systems. From small vocabulary recognition, we could advance to simultaneous interpretation, summarization, interactive dialog, multimodal systems and now automatic lip-synchronous dubbing. Despite the data-driven machine learning, however, speech science was necessary to inspire the models, and observing human communication continues to motivate our ongoing work in AI. In the first part of my talk, I will revisit some of our earliest prototypes, demonstrators, and their transition into start-up companies and products in the real world. I will highlight the research advances that took us there from poorly performing early attempts to human parity on popular performance benchmarks and the lessons learned. In the second part I will discuss current research and a roadmap for the future: the dream of a language-barrier free world between all the peoples on the planet has not yet been reached. What is the missing science and how can we approach the remaining challenges? What do we learn from human speech interaction, and what would future machine learning models have to look like to better emulate and engage in human interaction? What are the opportunities, and lessons learned for students, scientists, and entrepreneurs? The talk will include demos and examples of SOTA speech translation and dubbing systems.

Biography: Alexander Waibel is Professor of Computer Science at Carnegie Mellon University (USA) and at the Karlsruhe Institute of Technology (Germany). He is director of the International Center for Advanced Communication Technologies. Waibel is known for his work on AI, Machine Learning, Multimodal Interfaces and Speech Translation Systems. He proposed early Neural Network based Speech and Language systems, including in 1987 the TDNN, the first shift-invariant (“Convolutional”) Neural Network, and early Neural Speech and Language systems. Based on advances in ML, he and his team developed early (’93-’98) multimodal interfaces including the first emotion recognizer, face tracker, lipreader, error repair system, a meeting browser, support for smart rooms and human-robot collaboration. Waibel pioneered many cross-lingual communication systems that now overcome language barriers via speech and image interpretation: first consecutive (1992) and simultaneous (2005) speech translation systems, road sign translator, heads-up display translation goggles, face/lip and EMG translators. Waibel founded & co-founded more than 10 companies and various non-profit services to transition results from academic work to practical deployment. This included “Jibbigo LLC” (2009), the first speech translator on a phone (acquired by Facebook 2013), “M*Modal” medical transcription and reporting (acquired by Medquist and 3M), “Kites” interpreting services for subtitling and video conferencing (acquired by Zoom in 2021), “Lecture Translator”, the first automatic simultaneous translation service (2012) at Universities and European Parliament, and STS services for medical missions/disaster relief.

Waibel published ~1,000 articles, books, and patents. He is a member of the National Academy of Sciences of Germany, Life-Fellow of IEEE, Fellow of ISCA, Fellow of the Explorers Club, and Research Fellow at Zoom. Waibel received many awards, including the IEEE Flanagan award, the ICMI sustained achievement award, the Meta prize, the A. Zampolli award, and the Alcatel-SEL award. He received his BS from MIT, and MS and PhD degrees from CMU.

Prof. Dr. Carol Espy-Wilson

Electrical & Computer Engineering Department, Institute for Systems Research, University of Maryland College Park

Title: Speech Kinematic Analysis from Acoustics: Scientific, Clinical and Practical Applications

Abstract: Much of my research has involved studying how small changes in the spatiotemporal coordination of speech articulators affect variability in the acoustic characteristics of the speech signal. This interest in speech variability ultimately led me to develop a speech inversion (SI) system that recovers articulatory movements of the lips, tongue tip, and tongue body from the speech signal. Recently, we were able to extend the SI system to provide information about the velopharyngeal port opening (nasality) and will soon investigate a methodology to uncover information about the tongue root and the size of the glottal opening. Our SI system has proven to be speaker independent and generalizes well across acoustic databases. In this talk, I will explain how we developed the SI system, and ways in which we have used it to date: for clinical purposes in mental health and speech disorder assessment, in scientific analysis of cross-linguistic speech patterns, and for improving automatic speech recognition.

Biography: Carol Espy-Wilson is a full professor in the Electrical and Computer Engineering Department and the Institute for Systems Research at the University of Maryland College Park. She received her BS in electrical engineering from Stanford University and her MS, EE and PhD degrees in electrical engineering from the Massachusetts Institute of Technology. Dr. Espy-Wilson is a Fellow of the Acoustical Society of America (ASA), the International Speech Communication Association (ISCA) and the IEEE. She was recently elected VP-elect of ASA, and to the ISCA Advisory Board. She is currently serving on the Editorial Board of Computer, Speech and Language. She has been Chair of the Speech Communication Technical Committee of ASA, elected member of the Speech and Language Technical Committee of IEEE and Associate Editor of the Journal of the Acoustical Society of America. Finally, at the National Institutes of Health, she has served on the Advisory Councils for the National Institute on Deafness and Communication Disorders and the National Institutes of Biomedical Imaging and Bioengineering, on the Medical Rehabilitation Advisory Board of the National Institute of Child Health and Human Development, and she has been a member of the Language and Communication Study Section.

Carol directs the Speech Communication Lab where they combine digital signal processing, speech science, linguistics and machine learning to conduct research in speech communication. Current research projects include speech inversion, mental health assessment based on speech, video and text, speech recognition for elementary school classrooms, entrainment based on articulatory and facial gestures in unstructured conversations between neurotypical and neurodiverse participants, and speech enhancement. Her laboratory has received federal funding (NSF, NIH and DoD) and industry grants and she has 13 patents.

Dr. Judith Holler

Donders Centre for Brain, Cognition and Behaviour, Radboud University & Max Planck Institute for Psycholinguistics

Title: Using and comprehending language in face-to-face conversation

Abstract: Face-to-face conversational interaction is at the very heart of human sociality and the natural ecological niche in which language has evolved and is acquired. Yet, we still know rather little about how utterances are produced and comprehended in this environment. In this talk, I will focus on how hand gestures, facial and head movements are organised to convey semantic and pragmatic meaning in conversation, as well as on how the presence and timing of these signals impacts utterance comprehension and responding. Specifically, I will present studies based on complementary approaches, which feed into and inform one another. This includes qualitative and quantitative multimodal corpus studies showing that visual signals indeed often occur early, and experimental comprehension studies, which are based on and inspired by the corpus results, implementing controlled manipulations to test for causal effects between visual bodily signals and comprehension processes and mechanisms. These experiments include behavioural and EEG studies, most of them using multimodally animated virtual characters. Together, the findings provide evidence for the hypothesis that visual bodily signals form an integral part of semantic and pragmatic meaning communication in conversational interaction, and that they facilitate language processing, especially due to their timing and the predictive potential they gain through their temporal orchestration.

Biography: Judith Holler is Associate Professor at the Donders Institute for Brain, Cognition & Behaviour, Radboud University where she leads the research group Communication in Social Interaction, and senior investigator at the Max Planck Institute for Psycholinguistics. Her research program investigates human language in the very environment in which it has evolved, is acquired, and used most: face-to-face interaction. Within this context, Judith focuses on the semantics and pragmatics of human communication from a multimodal perspective considering spoken language within the rich, visual infrastructure that embeds it, such as manual gestures, head movements, facial signals, and gaze. She uses a combination of methods from different fields to investigate human multimodal communication, including quantitative conversational corpus analyses, in-situ eyetracking, behavioural and neurocognitive experimentation using multimodal language stimuli involving virtual animations. Her research has been supported by a range of prestigious research grants from funders including the European Research Council (EU), The Dutch Science Foundation (NWO), Marie Curie Fellowships (EU), Economic & Social Research Council (UK), Parkinson UK, The Leverhulme Trust (UK), the British Academy (UK), Volkswagen Stiftung (Germany) and the German Science Foundation (DFG, Mercator Fellowships).

Interspeech 2025

PCO: TU Delft Events

Delft University of Technology

Communication Department

Prometheusplein 1

2628 ZC Delft

The Netherlands

Email: pco@interspeech2025.org

X (formerly Twitter): @ISCAInterspeech

Bluesky: @interspeech.bsky.social

Interspeech 2025 is working under the privacy policy of TU Delft

Interspeech 2025 Keynotes​​

Interspeech 2025 Keynotes