The Yale University Librarian, Rutherford D. Rogers once said “We’re drowning in information and starving for knowledge.” Acquiring data is only the first step toward gaining knowledge; pro- cessing and analyzing must follow to reveal any information embedded in a dataset. However, over the last two decades, digitized data (both automated and manually collected) has accumulated at a rate that has surpassed the capabilities of our current processing methods. Researchers, policy makers, analysts, and investigators demand the development of new analysis methods.

If granted, this proposal will meet the demand by merging two areas of research: Visual Ana- lytics and Bayesian Statistics. Visualization techniques are very useful in that they provide unique, low-dimensional views of complex datasets to illuminate hidden structure or trends and promote human-data interaction. However, current visualizations do not incorporate notions of uncertainty that are vital for accurate data interpretations. Statistical methods, on the other hand, use math- ematical models to reveal important characteristics of the data and provide model uncertainty assessments so that data-informed hypotheses may form; although, assessing statistical results immediately and well-enough to make fast, hard decisions can be a challenge. Therefore, the fun- damental deliverable of this grant is Bayesian visualization software that presents probabilistic, low-dimensional representations of data.

Intellectual merit: Current visualizations display inflexible, deterministic transformations of data that inherently separate data visualization from visual synthesis. Namely, analysts cannot manip- ulate displays to inject domain-specific knowledge to assess the merger of their expert judgment with the data formally. However, given a change in the nature of the data transformation from deterministic to probabilistic, manipulations to a display are possible to interpret quantitatively. Thus, a new visualization model called Malleable Visualization is developed which relies on editable representations to promote bidirectional flow between the analyst and the data.

With the use of probabilistic data transformations, one data display may represent information from multiple sources and/or on different scales. In this proposal, Bayesian meta-analysis tech- niques are developed to merge multi-source and multi-scale data for visualization. However, in some cases, the Bayesian machinery might fail due to computational limitations and the size of the dataset. Thus, as part of this project an algorithm called Stratified Uniform Bayesian Sampling (SUBS) is developed that can be applied in any field and enables the assessment of large datasets based on several, manageable subsets.

Broader Impact: While our proposed research will immediately impact how analysts discover new information in very large datasets, our methodology will transition smoothly into the classroom environment. In many situations, the Bayesian paradigm is presented as a complex mathematical framework that takes many years to master. In fact, Bayesian courses are typically reserved for advanced students and are often overlooked at the undergraduate level. While we promote learning the fundamental mathematical theory, we also promote learning Bayesian statistics at an intuitive level. Our combination of Bayesian methodology and visual analytics will serve as a launch pad for understanding the role of prior information. Students may assess the impact of various levels of prior information on probability models through visual representations and our sense-making feedback loop. In turn, students with varying backgrounds, interests, and talents, including undergraduates and underrepresented groups with limited academic history, will now have the opportunity to learn 21st century statistics. 

We are in the midst of an “information revolution” [Champkin, 2011]. In year 2000, research at the University of California, Berkeley reported that the world produces one-two billion gigabytes of “text, numbers, images, sounds, and other forms of information that are deemed important by humans for different purposes” [Thomas et al., 2001]. Since then, it is speculated that the amount of data available globally has grown exponentially, yet many U.S. college graduates do not have the skills to learn from it.

Innovations in pedagogy and curriculum development for introductory, undergraduate Data Analytics (DA) have not paralleled the dramatic advancements in data collection technology. To learn from data, it takes both 1) DA skills to access, process, summarize, and interpret large, un- ruly datasets and 2) comprehensive critical thinking skills to compartmentalize large problems into manageable pieces, formulate and evaluate solutions with quantitative and/or qualitative rigor, make judgements that assimilate current information with new data, and reflect upon the objectiv- ity and/or constraints of those judgements. Granted the success of this proposal, first and second year college students with varying backgrounds, interests, and talents will have the opportunity to gain all of the skills necessary to learn from data.

Intellectual Merit: A new course is proposed called, “Critical Thinking with Data Visuali- sation” (CTDV). This course is unique in that it uses novel interactive data visualization software (and techniques therein) as a platform for students to build from what they know and construct their understanding of 1) how to think critically, 2) the role of data in critical thinking, and 3) the mathematical and computational methods taught in class to summarize high-dimensional data. Crucially, DA and critical thinking are taught in tandem so that students do not need to master complex quantitative methods before experiencing how to use data for problem solving.

The interactive data visualization software will be designed, developed, and tested for usability. The software is based on methods developed by PI’s of this proposal called Bayesian Visual Analyt- ics (BaVA). BaVA enables domain experts, e.g., students, to incorporate their domain knowledge within quantitatively rigorous data characterizations, without technical training in statistics nor computer science. For this reason, BaVA fosters creative, critical thinking with data.

Broader Impacts: The impacts of this proposal are profound. First, the units within CTDV are designed so that any STEM professor or secondary education teacher may select one or more to incorporate into their curricula. Although CTDV is designed for first and second year college students, the CTDV units are easy to adjust for varying target audiences.
Second, CTDV focuses on four case studies (one is shared in the proposal) to provide real- world problems for which students may suggest solutions. The intent is to engage students with diverse backgrounds and academic histories. For example, most students have opinions (voiced or unvoiced). The case studies draw on those opinions and promote conversation among students and professors so that students, regardless of their personal backgrounds, cultures, mathematical talents, and computational experiences, gain 1) the confidence to engage in technical discussions and 2) the motivation to learn complex DA methods. With this in mind, CTDV provides an excellent opportunity to recruit and retain students in STEM disciplines.

Finally, students of today are tomorrow’s doctors, government officials, and industrial leaders. Students who graduate having taken CTDV will make informed, data supported decisions that will impact our country. As pointed out by the 2009-2010 U.S. Director of the Office of Management and Budget, Peter Orszag, “Robust, unbiased data (analyses) are the first step toward addressing our long-term economic needs and key policy priorities.”

The goal of this project is to promote creative interactive data exploration through the novel combination of the physical and virtual worlds. Our fundamental assertion is that data exploration, and the necessary understanding of the complex analytical methods behind data exploration, is stymied by its traditional strict limitation to the virtual world. The important concepts and insights are veiled behind small screen portals and simplistic interaction mechanics. Lakoff and Nunez point out that human understanding of mathematical concepts is rooted in physical embodied interactions. How can we bring complex virtual concepts related to data analytics, such as dimensionality reduction of high-dimensional data, into the physical interactive world?

We begin, in particular, by targeting educational scenarios of students learning to explore complex data, with the hope of expanding towards more advanced data analytic scenarios in the future. We find that, from the beginning, students think simplistically about data and data exploration due to a lack of appropriate physical and interactive metaphors. In classroom analytical exercises, we observe that many students lack “cognitive high-dimensionality”, and focus on one dimension of the data at a time. Conceptualizing high-dimensional data is not easy; it is inherently subjective and uncertain. Yet, we have also observed that there is hope, that with appropriate interactive mechanisms students can learn to conceptualize multiple features in datasets.

Our approach is to combine the physical and virtual to enable students to “be the data”. Students take an egocentric perspective on the data, interacting with each other in collaborative groups to explore interactions among data points. We exploit a unique new physical space, the ICAT Cube, novel interactive media for augmented reality, advanced interactive technologies for physical-virtual cross-overs, large display systems, and our recent ground-breaking research in direct manipulation of high-dimensional models.

Our primary objective is to design prototype systems in which students can enter a physical space and embody virtual data points. Students can then physically move about the space to explore relationships among their data points. Their movement in the space is tracked so as to apply these physical interactions to the virtual mathematical models. Their data points are also displayed dynamically in the space (e.g. on the floor or other large display area), and students can chase their data points around the space as the mathematical models update. The real space is augmented with virtual information about relationships between data points and dimensions. We then conduct formative experimental usability studies to evaluate the impact of these mixed physical/virtual environments on students understanding of complex analytical models, such as dimensionality reduction.

Gaining big insight from big data requires big analytics, which poses big usability problems. Analyses of big data often rely on several computational and statistical models that operate on multiple levels of data scale to discover and characterize latent data structure. The models work jointly or in sequence to filter, group, summarize, and visualize big data so that analysts may assess the data. As a simple example in big text analytics, massive text is first sampled for relevant or representative words, then further reduced by topic modeling, then visualized by applying a dimension reduction algorithm. As the size of data increases, so does the number of models and, likewise, the need for human interaction in the analytical process. By interacting, humans inject expert judgment into the analytical process, and efficiently explore and make sense of big data from varying perspectives. However, because of complex low-level parameters and enforced premature formality, interacting with any individual model is difficult, and now, there is a need to interact with a growing number of models. In this proposal, current human-computer-interaction research is merged with complex statistical methods and fast computation to make big data analytics usable and accessible to professional and student users.

Our solution is to scale up Visual to Parametric Interaction (V2PI) to a new framework called Multi-scale V2PI (MV2PI). V2PI currently supports usable small-data analytics, and enables users to adjust model parameters by interacting directly with data in a visualization. That is, V2PI interprets visual interactions quantitatively to update parameters and produce new visualizations. MV2PI is a new interactive framework that links together multiple models operating at multiple levels of data scale in a unified interactive space. Model results are combined into a common visual representation. Directly manipulating the small-scale visual representation propagates to larger scale models by inverting the models to update their parameters, ultimately producing a new output result. In the text analytics example, if the user drags several data points together to hypothesize a cluster, the inverted dimensionality reduction model computes updated dimension weights, queries relevant new hits at the large scale, identifies changed topics, and updates the layout to show big-data support for the new cluster. This approach enables users to interactively explore large-scale data and complex inter-relationships between models in real time, and in a usable fashion that directly supports their natural cognitive sensemaking process. Intellectual merits are in the fundamentally novel approach to interactively combining multiple statistical data models across levels of data scale to enable usable big-data analytics. This research will (1) create the conceptual MV2PI pipeline, and identify alternatives for communication flow between models, visualization, and interaction, including possible shared parameters; (2) establish several new useful models, covering different levels of scale, that support the V2PI model inversion approach to machine learning and can operate within the new pipeline; (3) develop new computational methods for high-performance updates to inverted models in support of real-time interaction with MV2PI; and (4) evaluate the usability of MV2PI and measure its impact on human sensemaking in big data analytics. Broader impacts stem from bringing attention to the critical role of usability in big data analytics. The outcomes of this research include (1) clear impacts of making big-data analytics accessible to end users who are experts in various data domains, but not in advanced statistical data models and algorithms; (2) development of educational programs in support of pedagogy for exploratory analytical thinking in the context of big data; (3) establishing a workshop focused on usability in big-data analytics to increase awareness and promote collaboration between computational and usability researchers; (4) outreach to government agencies with needs in big text analytics, through our involvement in DHS VACCINE and the national laboratories; and (5) involvement of diverse student populations in the research project as evidenced by our strong track record in diversity and undergraduate research.

This project is supported by NSF BigData grant IIS-1447416, Usable Big-Data Analytics via Multi-Scale Visual Interaction.