The following Great Innovative Idea is from Philip Cohen, Professor (adj) of Data Science and Artificial Intelligence at Monash University (Melbourne, Australia) and President of Multimodal Interfaces, LLC. Cohen was one of the winners from the Computing Community Consortium (CCC) sponsored Blue Sky Ideas Track Competition at the AAAI-20. His winning paper is called Back to the Future for Dialogue Research.
Current so-called “conversational assistants” provide minimal assistance and therefore are of limited utility. If they are successful in engaging in a dialogue, current systems typically will only perform the transactions that have explicitly been requested of them. But in our everyday human interactions, we expect people not only to infer what we literally say that we want, but to fit those desires/intentions into larger plans of actions and to respond to those plans. Thus, in asking “when is Dunkirk playing tonight at the Forum theater?” you may not merely want to know what time a particular movie is playing, but (before the days of COVID-19) you probably want to go to that theater in order to watch it and will therefore need tickets. Unfortunately, current assistant systems do not attempt to infer the obvious plans underlying the user’s utterances, so they cannot proactively debug them (no tickets available) and suggest alternatives.
A second limitation of existing conversational assistants is that they can only interact with a single person at a time, rather than serve the needs of groups of people, such as those of a family. In order to do so, such systems will need to distinguish among the people’s beliefs, desires, intentions, etc. But these systems are too representationally weak to make such distinctions. For example, they cannot represent that a person knows her/his child’s birthdate, or that person A is trying to convince person B to join them at a dinner.
Conversational assistants should determine a meaning representation of the user’s utterance and infer what dialogue action(s) the user is performing, such as requesting, confirming, etc. On the basis of those actions, the system can hypothesize the users’ mental state that s/he was intending to convey – what the user believes, desires, intends (Cohen and Levesque, 1990), etc. More generally, the system will also need to keep track of the joint mental states (mutual beliefs and joint intentions) that support collaborative interactions (Cohen and Levesque, 1991; Grosz and Sidner, 1990). Based on these states and the rational balance among them, the system would attempt to infer/recognize the plan that led to the user’s utterance, to debug the plan in order to find obstacles (e.g., the theater is closed, no tickets are available, etc.), to alert the user to those obstacles, and to suggest alternative ways for the user to achieve higher level goals (e.g., to see that movie, or simply to be entertained for the evening). People learn such helpful behavior at a very young age (Warneken and Tomasello, 2006), expect others to behave this way, and would expect their assistants to do likewise. Systems will need to engage in semantic parsing in order to uncover the utterance’s meaning, perhaps using a deep learning approach. In order to engage in plan recognition, systems will need to have knowledge about what people generally do and what are the typical preconditions and effects of those actions. The needed knowledge about actions can be gathered via crowd-sourcing and general knowledge sources, such as textual corpora. These are old ideas (Allen & Perrault, 1980; Cohen and Perrault, 1979; Cohen et al., 1982), but ones we can now fold into next generation conversational systems.
A system that can reason as above can not only become a collaborative conversational assistant for a single user, it can also serve the needs of a group of users, enabling the system to track and participate in conversations among multiple participants. More generally, it could function as the core of a robotic assistant that can decide how and when to be helpful, using conversational interaction as one means of doing so. Because it has a plan standing behind its utterances, such a system can explain why it performed its actions, including speech actions. Because it represents first what it believes the user is trying to accomplish with his/her utterances, it is able to protect its innermost mental states. This means that it does not simply believe what the user tells it, nor do what the user wants it to do. If it has adequate representational, decision-making, and analytical capabilities, it could be less manipulable by deceptive users.
For many years, I have been engaged in research on intelligent agents, human-computer dialogue, and multimodal interaction. My dialogue research at Monash and previously at Voicebox Technologies attempted to build systems that could engage in collaborative conversations. Currently I am engaged in dialogue research to help emergency call-takers identify cardiac arrest calls. A few years ago, I started a company (Adapx Inc.) that worked on multimodal interaction. Though it no longer exists, its multimodal (speech and sketch, digital pen) interaction systems are being deployed by numerous corporations for field data collection, and by the Norwegian government for simulations. My most recent publishing project was to co-edit (with S. Oviatt, B. Schuller, D. Sonntag, G. Potamianos, and A. Krüger (eds.)) the Handbook of Multimodal-Multisensor Interfaces, Vols 1-3, ACM Press/Morgan and Claypool, 2017-2019.
I am currently a Professor (adj) of Data Science and Artificial Intelligence at Monash University (Melbourne, Australia), and President of Multimodal Interfaces, LLC. Previously, I was Professor of Artificial Intelligence at Monash, Chief Scientist and Senior VP at Voicebox Technologies, Founder and EVP at Adapx Inc., Professor at the Oregon Health and Science University, and Senior Computer Scientist/Program Manager of the Natural Language Program in SRI International’s Artificial Intelligence Center. I am a Fellow of the Association for Computing Machinery, Fellow of Association for the Advancement of Artificial Intelligence, winner of the 2017 Sustained Accomplishment Award of the International Conference on Multimodal Interaction, and co-awardee (with H. Levesque) of an Inaugural Influential Paper Award (2006) for “Intention is Choice with Commitment”, Artificial Intelligence 42(3), 1990 by the International Foundation for Autonomous Agents and Multi-Agent Systems. Finally, I am the originator of the project at SRI International that developed the Open Agent Architecture, which eventually became SRI’s (and then Apple’s) Siri.
Allen, J. F. and Perrault, C. R., Analyzing intention in utterances, Artificial Intelligence 15, 1980, 143-178.
Cohen, P. R. and Perrault, C. R. Elements of a plan-based theory of speech acts, Cognitive Science 3(3), 1979.
Cohen, P. R. and Levesque, H. J. Intention is Choice with Commitment”, Artificial Intelligence 42(3), 1990
Cohen, P. R., Perrault, C. R., and Allen, J. F. Beyond question answering, Strategies for Natural Language Processing, W. Lehnert and M. Ringle (eds.), 1982.
Cohen, P. R. and Levesque, H. J. Teamwork, Noûs, 1991.
Oviatt, S. Schuller, B., Cohen, P. R., Sonntag, D. , Potamianos G., and Kruger, A. (eds.). Handbook of Multimodal-Multisensor Interfaces, Vols 1-3, ACM Press/Morgan and Claypool, 2017-2019.
Warneken, F. and Tomasello, M. 2006. Altruistic helping in human infants and young chimpanzees, Science 311(5765) 03 Mar 2006, 1301-130.