Sirius: Visualization of Mixed Features as a Mutual Information Network Graph
Jane L. Adams, Todd F. DeLuca, Yuhang Zheng, Konstantin Anastasakis, Boyoon Choi, Alison Min, Michael Bessey, Christopher M. Danforth, Peter Sheridan Dodds
Data scientists across disciplines are increasingly in need of exploratory analysis tools for data sets with a high volume of features of mixed data type (quantitative continuous and discrete categorical). We introduce Sirius, a novel visualization package for researchers to explore feature relationships among mixed data types using mutual information. The visualization of feature relationships aids data scientists in finding meaningful dependence among features prior to the development of predictive modeling pipelines, which can inform downstream analysis such as feature selection, feature extraction, and early detection of potential proxy variables. Using an information theoretic approach, Sirius supports network visualization of heterogeneous data sets (consisting of continuous and discrete data types), and provides a user interface for exploring feature pairs with locally significant mutual information scores. Mutual information algorithm and bivariate chart types are assigned on a data type pairing basis (continuous-continuous, discrete-discrete, and discrete-continuous). We show how this tool can be used for tasks such as hypothesis confirmation, identification of predictive features, suggestions for feature extraction, or early warning of data abnormalities. All code and supplemental materials can be accessed at https://osf.io/pdm9r.
We compare Sirius to other tools for exploratory analysis of high-dimensional data along our criteria C1 Support mixed data types, C2 Show dimension space and data item space, C3 Do not distort raw data, C4 Ensure system modularity, and C5 Open source project. We show that while existing methods do address some of these criteria, our users were in need of a system that met all of the requirements, and therefore Sirius is an important addition to the ecosystem of tools for high-dimensional exploratory analysis.
In this paper, we contribute a novel tool for handling mixed data types in a high-dimensional exploratory data analysis pipeline without binning or projection. Sirius is a proof of concept implementation for this novel visual rendering approach, which relies on automated selection of 1) mutual information algorithm and 2) subsequent bivariate chart type, based on data type. Sirius is one of the first systems to handle both discrete and continuous data for extremely high-dimensional data, and the first to not discretize continuous data. By keeping continuous data continuous and not projecting data points into lower-dimensional space, we enable data analysts to retain an un-obfuscated view of raw record-level data, while providing an overview of feature dependencies in a graph-aware manner.
We couch our system within the technique classification, visualization pipeline, and user interaction paradigms of existing high-dimensional analysis tools.
There are four main views in the Sirius interface: A. The network graph of features is displayed, with a dropdown for selecting the network layout algorithm. B. When a user hovers over an edge in (A), the bivariate chart for that particular feature pair is shown in (B). C. The alpha significance chart shows metadata for signficance value, number of connected components, and the size ratio of node count for the first and second largest components. D. The matrix view shows the mutual information values of pairs of features, sparsified by the alpha significance value and sorted using hierarchical clustering.
T1 Confirmatory Analysis A bivariate plot showing House Style vs. Second Floor Square Footage, indicating a relationship between multi-level homes and higher second floor square footage. T2 Suggested Feature Extraction A bivariate plot is displayed showing “Year Built” compared to the “Garage Year Built”. T3 Identify Predictive Potential Here we see a suggested feature plot from Sirius comparing “Garage Cars” to “Garage Area”. T4 Raise Anonymization Warnings A screenshot from Sirius showing a bivariate plot of “PoolQC” versus “ID”. “PoolQC” is a quality measure which indicates ’Excellent’ (’Ex’), ’Fair’ (’Fa’), or Good (’Gd’).
The mutual information algorithm is chosen based on the feature types.
The value of the backbone method as a dynamic thresholding sparsification approach, as compared to static thresholding, is that back- bone sparsification is network-aware. In real-world data contexts, often certain features will be densely connected to their neighbors with respect to mutual information, because they are near-identical matches for one another. For example, in a healthcare setting, “hemoglobin apache” might simply be a value that is computed from “hemoglobin 1a” and “hemoglobin 1b”. Therefore, thresholding the network graph by the mu- tual information value alone at a static threshold would result in the visual prioritization of extremely dense clusters, but miss the statistically inter- esting feature relationships. We use backbone thresholding to assign an alpha significance value to each edge based on the relative weights of other edges in the network, in order to sparsify the matrix in a more graph-aware manner.
When the alpha significance is at 0, no nodes are connected, and thus the number of components C is 0 (α1). As the alpha threshold for the network is adjusted upwards, the number of edges increases, creating more connected components. At a certain alpha threshold (here, α3), the number of connected components in the graph reaches a maximum (here, nC = 5 at α3), before decreasing as discrete components become connected to one another, finally resulting in a single connected component encompassing the entire graph (α5).
We show the four main components of the Sirius exploratory analysis dashboard: A) The mutual information feature network, in which discrete and continuous features are both represented as nodes in a graph with edges weighted by mutual information and sparsified according to their statistical significance. B) The bivariate chart view, which renders a chart showing the raw data for 2 features of interest based on the selected edge from the network graph in (A). A major contribution of this system is the automatic selection of one of three visual encodings, informed by the data types chosen. C) The alpha significance explorer, which allows users to review information related to the statistical significance of mutual information edges in the network graph and, optionally, to customize the significance threshold applied to the network. D) The matrix representation of the feature network shown in (A), shown as a heatmap where color corresponds to the mutual information values and features are sorted by information gain using hierarchical clustering