website banner

Slot 2 Tutorials (12:00-15:00)

  • Option 1: Invited Tutorial: "Creating sound with Praat", by Paul Boersma (University of Amsterdam), author of Praat

    • Short description: This tutorial gives an overview of the many ways in which you can generate sounds, especially speech sounds, in the Praat software. You are invited to bring your laptop to try them out, including several recently added methods. In Praat you can generate sounds in many ways, both from partial or full specifications and from other (speech) sounds. The most straightforward method to create a sound is from a formula; formulas can range from easy to intricate, especially when combined. Speechlike sounds can be created from partial specifications, such as a pitch contour, or glottal pulses, or a spectrum, or a spectrogram, or an intensity curve, or combinations of those. And speech can be created from full specifications, which can involve articulatory synthesis, acoustic (e.g. Klatt) synthesis, and text-to-speech methods, which give especially good results if you install an external large language model that Praat will use in the background. Finally, speech can be created by manipulating existing sounds. I can discuss the manipulation of pitch, duration, formants, loudness, voicing, creakiness, breathiness, jitter, shimmer. You can send me (Paul) questions and requests before the tutorial, so that I can adapt the subjects to your wishes.

    • About the tutorial organizer:

      • Paul Boersma received an MSc in physics from the University of Nijmegen in 1988 and a PhD in linguistics from the University of Amsterdam in 1998. Since 2005 he has been Professor of Phonetic Sciences at the University of Amsterdam. His research focuses on modelling and simulating the acquisition, evolution and typology of the production and comprehension of phonology and phonetics. For this he developed a bidirectional model of phonology and phonetics (BiPhon) in which the speaker and listener travel the same morphological, phonological and phonetic levels of representation, which are connected by symmetric constraints that are weighted or ranked, or by symmetric neural network connections. His further research involves the history of the Franconian tone systems. Boersma is also the designer and main author (with David Weenink) of Praat, the world’s most used computer program for the analysis and manipulation of speech.

  • Option 2: "A Journey through Emerging Speech Research with NVIDIA NeMo", by Piotr Zelasko (NVIDIA), Nithin Rao Koluguri (NVIDIA), Ante Jukic (NVIDIA), Subhankar Ghosh (NVIDIA), Travis Bartley (NVIDIA), Elena Rastorgueva (NVIDIA), Taejin Park (NVIDIA)

    • Short description: This tutorial presents a comprehensive overview of recent developments in NVIDIA NeMo, a popular open-source framework recognized for its state-of-the-art performance in speech processing tasks. The tutorial is structured in two segments that bridge classical and emerging speech technologies. The first segment introduces significant architectural improvements in speech recognition through the FastConformer architecture with label looping and CUDA graph acceleration, alongside novel approaches to speech enhancement using score-based diffusion models, flow-matching, and Schr¨odinger bridge techniques. It also presents pioneering work in high-quality neural audio codecs. The second segment explores developments in multi-task speech modeling through Canary-1B, end-to-end speaker diarization with Sortformer Diarizer, and the integration of speech capabilities with Large Language Models via SALM and BESTOW frameworks. A particular emphasis is placed on training efficiency innovations, including dynamic bucketing and batch size optimization techniques that achieve up to 2x faster training speeds. By combining theoretical foundations with practical implementation guidance, this tutorial serves as a valuable resource for both researchers and practitioners in speech technology, offering hands-on experience with state-of-the-art tools and techniques.

    • About tutorial organizers:

      • Piotr Zelasko received the B.S. and M.Sc. degrees in Acoustic Engineering, and Ph.D. in Electronic Engineering (2019) at AGH-UST in Poland. He worked with several companies and held a research scientist position at JHU’s CLSP. At present, he is a research scientist at NVIDIA NeMo building multitask and multimodal models and efficient training infrastructure. Piotr is a co-author of the next-generation Kaldi toolkit (k2).

      • Nithin Rao Koluguri received his MS in Electrical Engineering from USC, Los Angeles. He worked as a researcher at USC SAIL and IISc SPIRE Labs. Currently, he holds a position as a research scientist on the NVIDIA Conversational AI team, focusing on advancing speech and speaker recognition models. As a key contributor to the NVIDIA NeMo toolkit, he plays a vital role in enhancing features for conversational AI model development.

      • Ante Jukić received his Dipl.-Ing degree in Electrical Engineering from the University of Zagreb, Croatia, and his Ph.D. degree in Engineering from the University of Oldenburg, Germany. Currently, he’s with NVIDIA’s Conversational AI team, working on generative models for speech and audio.

      • Subhankar Ghosh received his M.S. in Statistics from the University of Illinois at Urbana-Champaign. Subhankar also has a Bachelor’s degree in Computer Science from NIT Rourkela, India. He has previously worked at Microsoft and Google. Currently, he is a Research Scientist working on the NVIDIA conversational AI team, working on speech synthesis, LLMs for speech, and speech to speech technology.

      • Travis M. Bartley is a PhD Candidate at the City University of New York’s Graduate Center. He received his B.A. in English and Linguistics at the University of California at Berkeley. He has held teaching positions at Medgar Evers and Baruch Colleges. He is currently a Deep Learning Engineer with NVIDIA Riva Speech and Translation AI team.

      • Elena Rastorgueva received her B.A. and M.Eng. degrees in Engineering from the University of Cambridge. She is currently an applied research scientist on the NVIDIA Conversational AI team, focusing on speech to text models and improving their performance in streaming scenarios.

      • Taejin Park is a Senior Research Scientist at NVIDIA, NeMo Speech AI. His research focuses on deep learning for speech processing, including context-aware speaker diarization, and multi-speaker automatic speech recognition (ASR). He received his Ph.D. in Electrical and Computer Engineering and M.S. in Computer Science from the University of Southern California (USC) in 2021, where he was part of the Signal Analysis and Interpretation Laboratory (SAIL). Before that, he earned his B.S. and M.S. in Electrical Engineering and Computer Science from Seoul National University (SNU), South Korea. Prior to joining NVIDIA, he worked as a researcher at the Electronics and Telecommunications Research Institute (ETRI) and held internships at Microsoft, Amazon Alexa Speech, and Capio Inc., where he contributed to advancements in federated continual learning, ASR, and speaker diarization. Taejin Park has published extensively in signal processing-related conferences and journals such as ICASSP, Interspeech, and IEEE Signal Processing Letters.

  • Option 3: "Tutorial on Speech Watermarking", by Patrick OReilly (Northwestern University), Bryan Pardo (Northwestern University)

    • Short description: While speech synthesis systems have numerous benefits, they can also be used to impersonate the voices of individuals, serving as tools for blackmail, fraud, and misinformation. As a result, developers and providers of speech synthesis systems need reliable methods for identifying the synthetic audio these systems produce. One promising and widely adopted method is watermarking, which hides an imperceptible identifying signal in audio produced by a speech synthesis system to facilitate the detection process. In this tutorial, we provide an overview of the speech watermarking literature and walk attendees through step-by-step implementations of both traditional signal-processing and recent neural network-based watermarks. We explore methods for improving the perceptual transparency and robustness of watermarks, evaluate their effectiveness under "attacks" that attempt to remove watermarks, and discuss future directions for improving watermark performance. Participants will leave this tutorial with a strong knowledge of speech watermarking methods including the current state-of-the-art. Through hands-on experience implementing and evaluating speech watermarks, participants will gain a practical understanding of important techniques in the literature and an appreciation for the challenges faced in the development and deployment of speech watermarks.

    • About tutorial organizers:

      • Patrick O’Reilly (https:/oreillyp.github.io/) is a fifth-year PhD student at Northwestern University. Patrick is the lead author of "Maskmark: Robust Neural Watermarking for Real and Synthetic Speech", which proposed a neural network-based speech watermark with state-of-the-art robustness and was selected for oral presentation at ICASSP 2024. Patrick is currently conducting research in collaboration with Adobe to develop novel watermarking methods for speech and general audio, which will serve as the focus of his dissertation.

      • Bryan Pardo (https://bryan-pardo.github.io/) is Professor of Computer Science at Northwestern University, where he leads the Interactive Audio Lab. Bryan has given over 70 invited talks at universities and conferences (e.g. Princeton, University of Michigan, UC Berkeley, the Audio Engineering Society) and served in editorial roles at journals such as IEEE Transactions on Audio, Speech, and Language Processing and the International Society of Music Information Retrieval, along with committee and conference chair positions.