My research sits at the intersection of Natural Language Processing (NLP) and speech processing. I have focused on identifying the role of prosodic information in speech and using this knowledge to produce more realistic Text-to-Speech Synthesis (TTS) systems; to detect many types of speaker state, including the classic emotions, such as anger, disgust, fear, happiness, sadness, and surprise; and derived emotions, such as confidence and uncertainty, deception, trust, and charisma. I’ve also studied human-machine and human-human behavior in Spoken Dialogue Systems (SDS) and Human-Computer Interaction (HCI).
After many years working in TTS and HCI at Bell Labs and AT&T Labs Research, I moved to academia at Columbia University in 2002. Here my research group has made major contributions to the automatic identification of emotional speech (confidence, uncertainty, and the classic emotions of anger, joy, surprise, sadness, disgust, and fear), improving over previous work by using higher level prosodic information as well as low-level features derived from pitch, speaking rate, and energy. I have also done numerous studies of cross-cultural perception of charismatic speech, identifying important similarities and differences in the prosodic factors correlated with perception of charisma by American English, Palestinian Arabic, and Swedish listeners. Work by myself and colleagues on deceptive speech has produced the two cleanly recorded corpora of deceptive and non-deceptive speech: the Columbia SRI Colorado Corpus and the Columbia Cross-Cultural Corpus (CxC), totaling about 130h of speech. Machine learning algorithms trained on acoustic-prosodic, lexical, personality, gender, and native language features on the CxC corpus have generated the best results to date on deception detection from spoken language alone. This work has also uncovered individual differences in deception behavior and ability to detect deception based on cultural, gender, and native language among male and female speakers of American English and Mandarin Chinese. Currently, we are working on the identification of vocal and lexical features that characterize “trusted” voices as well as oral indicators that one speaker trusts another.
Understanding Dialogue Systems
My lab’s work on dialogue systems has been based largely on prosodic analyses of the Columbia Games Corpus which we have also collected. This work has focused on the identification of turn-taking behaviors; the detection of human corrections, “inappropriate” system responses, and likely Automatic Speech Recognition (ASR) errors; and the role of prosodic entrainment (the propensity of conversational partners to begin behaving like each other, thus appearing more likeable, intelligent, and knowledgeable). Our turn-taking work has shown that prosodic cues are a reliable cue to determining that speakers are preparing to give up their turn and that backchannels (e.g., ok, yeah) can be distinguished from actual turn-taking behaviors in terms of their prosody. These findings are critical for the development of SDS which anticipate turn-endings via prosodic cues, rather than waiting for long pauses to occur, thus speeding up the dialogue, and for SDS that do not interpret backchannel indications of continued attention as attempts to take the turn. Work on corrections and ASR errors has provided evidence of new prosodic features that can be used to identify speaker attempts at correcting the system, system clarification questions that a system has asked incorrectly, and speaker input that is more likely to be misrecognized – all important to improving system performance. Our studies of entrainment in conversation have provided evidence not only that prosodic entrainment is ubiquitous, with major similarities across several different cultures (American, Chinese, Slovakian, and Argentine), but that SDS that entrain to their users can be implemented to operate in real time and are preferred by their users. This work has sparked numerous new research efforts internationally on prosodic entrainment in other languages and on the possibilities of creating avatars and robots that can engender trust by entraining to their human partners.
Research on Low Recourse Languages: Generation and Analysis
In addition to these projects, the Columbia Speech Lab is continuing work on the automatic identification of classic emotions and also on positive and negative sentiment (valence) from speech, particularly in Low Resource Languages (LRLs). LRLs are languages with few existing computational resources to aid in building ASR and TTS systems, parsers, or machine translation systems. Of the approximately 6,500 languages spoken in the world today, many are LRLs and many are spoken by millions, including languages such as Bengali, Hausa, Swahili, Telugu, Tagalog, Amharic, and Turkish. Our lab is building TTS systems for LRLs such as Turkish and Amharic from “found” data. Commercial TTS systems require very expensive recordings to be created and annotated for each new language (about $1 million per voice). However, there are vast quantities of data in many languages available on the web (e.g., audio books, news broadcasts, Bibles, and Korans) or collected for non-TTS purposes (e.g., ASR). This data, if properly filtered, can be used to create intelligible TTS systems for millions of people who speak LRLs and have no access to SDS in their own language or to synthesized web text on their phones.
Another project arising from our lab’s multi-cultural interests and diversity is work on how to detect code-switching in text and speech. Code-switching occurs between bilinguals in conversation or on social media, as they switch naturally between one language and another, either within or across sentences. Code-switching wreaks havoc with all NLP tools, which currently are trained on a single language. So, methods of detecting code-switch points are critical to technologies such as ASR and machine translation, so that these technologies know when to switch to models trained on a new language and when to switch back again. Much of this work has shown interesting synergies, including findings of acoustic-prosodic and lexical entrainment in the deception interviews and in code-switching behavior as well. Altogether these projects not only advance the state-of-the-art in speech and NLP research but they provide stimulating research foci for graduate and undergraduate students alike, working together on challenging questions whose solutions will be useful for researchers and technology users alike. The 20 graduate and undergraduates working on research in the lab now come from seven different countries. Eleven of these students are women, of which six are Ph.D. students.
About the author
Julia Hirschberg is Percy K. and Vida L. W. Hudson Professor and chair of computer science at Columbia University. She previously worked at Bell Laboratories and AT&T Labs where she created the HCI Research Department. Hirschberg served on the Association for Computational Linguistics executive board (1993-2003), the International Speech Communication Association board (1999-2007; 2005-7 as president), the International Conference on Spoken Language Processing board (since 1996), the NAACL executive board (2012-15), the CRA Executive Board (2013-14; 2018–), and the AAAI Council (2012-15). She has been editor of Computational Linguistics and Speech Communication, and is a fellow of AAAI, ISCA, ACL, ACM, and IEEE. She is also a member of the National Academy of Engineering. Hirschberg received the IEEE James L. Flanagan Speech and Audio Processing Award and the ISCA Medal for Scientific Achievement. She currently serves on the IEEE Speech and Language Processing Technical Committee, is co-chair of the CRA-W Board, and has worked to improve diversity at Bell Labs, AT&T Labs, and Columbia University.