By H. V. Jagadish
I study how data and people interact. For more than a decade, I have been studying how to help humans access and manage information. While there is a lot of good work on human-computer interaction and on data visualization, much less work exists on “human-data interaction.” Why can anyone use Google to get information of interest while it is so difficult to get useful information from a structured database? The difference lies in the specificity of the request. A web search engine receives your request and tries to guess your intention. You know that it has a limited understanding of your need, and are happy to have it get you into “the zone,” from where you can explore for yourself. On the other hand, a traditional database query engine can give you complete answers to complex questions but requires that you precisely specify your query. If you make a small mistake, you are out of luck. Wouldn’t it be helpful to devise database query mechanisms that you can actually use and get reasonable results from even if you don’t ask it totally correctly? Complementarily, can the system help you ask a better question in the first place? Similar concerns also apply to the creation of a database, and helping users manage their data.
With these motivations, my research has addressed a number of specific issues: How to summarize complex schema for a user; how to design better web forms; how to give users a hint of the answer as they are specifying a query; how to fill in missing parts of a query; how to understand queries specified in English; and how to generalize from user-provided examples. To address these challenges, there are three main sources of information we can exploit: the schema or the structure of the database; the actual data, such as specific values and their statistics; and the log, namely queries previously asked by this user or others. We use all three sources of information when appropriate.
Recently, Big Data has become a popular term, along with Data Science. In consequence, there has been a large increase in the number of people managing and analyzing data, many of who possess little training in information management. These people cannot be successful unless we have systems that adequately support human-data interaction. As such, there is a greater need than ever for research in human-data interaction, and also a greater appreciation of it.
When we think of human-data interaction, we usually focus on how to facilitate this interaction. Indeed, most of my work has been in this area. However, in the last couple of years, I have started thinking about a completely different aspect of humans’ interaction with data: how Big Data impacts humans. The first thing one thinks of in this context is, of course, privacy. Also, fairness has recently started to get some attention. But many other research and societal issues warrant our consideration. For example, what constitutes a representative set of training data? How do we make it difficult for people to fudge data or game the system? What is the impact of data science methods on diversity in education, employment, and other areas? How do we think about the oversight of research, such as through an institutional review board, when the “experiment” is to perform analysis on data that have already been collected? I blog about these issues at http://bigdatadialog.com.
Most importantly, I feel strongly that technical people should speak up about the ethical issues surrounding data. They are the only ones with a technical understanding of Data Science algorithms and processes. If they do not speak up, societal decisions will get made with less understanding of the technology. Therefore, technical people should address their work’s consequences. To enable data scientists to do this, there must be education about ethics and responsible data analysis as part of every Data Science degree. To facilitate such learning, I have developed a MOOC and made it available with a Creative Commons license. The course material is at https://www.edx.org/course/data-science-ethics-michiganx-ds101x. Please feel free to peruse this modular study material and to adapt or reuse it in your own teaching.
About the Author
H. V. Jagadish is the Bernard A. Galler Collegiate Professor of Electrical Engineering and Computer Science, and a distinguished scientist at the Institute for Data Science, at the University of Michigan in Ann Arbor. Previously, he was director of the Software Systems Research Laboratory at the University of Michigan from 2007 to 2014.
Jagadish has written approximately 200 major papers about information management and has 37 patents. He has served on the board of the Computing Research Association since 2009, and was an author of the recent CRA report on data science. He has participated in several National Academies efforts, including its panel on improving federal statistics using multiple data sources. He is an ACM fellow, and his many awards include the ACM SIGMOD Contributions Award in 2013 and the University of Michigan’s David E. Liddle Research Excellence Award in 2008.