Know Your Data

Documentation

Introduction

Know Your Data is a tool to help researchers, engineers, product teams and policy teams explore datasets, improve data quality and mitigate bias issues.

KYD aims to answer the following questions:

Is my data corrupted? (e.g. broken images, garbled text, bad labels, etc).
Is my data sensitive? (e.g. are there humans, explicit content).
Does my data have gaps? (e.g. lack of daylight photos).
Is my dataset balanced across various attributes?

NOTE

The initial external release of Know Your Data supports image datasets supported by the TensorFlow Datasets API. We plan to support text and tabular datasets soon.

UI Overview

The UI is organized in tabs.

The header bar

The header shows the different tabs:

Datasets – A list of all the datasets that are visualized by Know Your Data
Stats – Global statistics of the features for a given dataset
Item – Metadata associated with an individual item in the dataset
Relations – Correlations between two different features

The header also shows the currently selected dataset name, as well as links on the right hand side to dataset information, feedback form, and this documentation.

The main panel (anything below the header) splits into two vertical sections:

The left side shows the contents of the currently selected tab.
The right side shows the Item Browser, an infinitely scrollable list of the currently selected items.

Datasets tab

The Datasets tab shows a list of all the datasets that are currently available in Know Your Data, along with a few sample thumbnails and the dataset size.

Stats tab

The Stats tab shows histograms of all the source features (also referred to as ground-truth features) in the dataset. In addition to those, Know Your Data also augments the dataset with derived features such as:

Cloud Vision Label Detection
Cloud Vision face detection
Generic image properties such as brightness, sharpness, resolution, format, aspect ratio, etc
EXIF metadata

The main goal of the Stats tab is to give a quick overview of the distribution of feature values and to allow the user to filter the data and explore subsets.

Clicking on a specific histogram bar applies a filter for that value on the dataset. When a filter is applied, all other histograms, as well as the Item Browser, react to that filter. Multiple filters can be chained together.

The same action can be achieved by clicking the “Add Filter” button.

Item browser

The Item Browser located on the right side is an infinitely scrollable list of item thumbnails.

The control bar located at the top of the item browser allows you to do various queries on the dataset.

Select and show metadata

Draw annotations

Group items that share the same feature

Sort the results by a feature value

lightbulb_outlineTIP:

One of the derived features in Know Your Data is similarity→cluster, which groups visually similar images together.

Item tab

The Item tab shows all (source & derived) features associated with a specific item in the dataset. Clicking on a thumbnail in the item browser switches you to the Item tab and shows the metadata associated with that item.

Relations

In addition to browsing individual signals, KYD allows you to explore the relation between two different signals. For example, we can measure the correlation between Cloud Vision Labels and caption_words_gendered in the Coco Captions dataset.

There are two ways to open to the Relations table:

Using the menu in the stats card

Selecting features in the Relations tab

Each cell indicates either a positive (blue color) or negative (orange color) correlation between two specific signal values along with the strength of that correlation. The metric we use is inspired by the research of Aka et al., AIES '21 and is closely related to the PMI metric that tells us if two different feature values co-occur less or more than chance.

To explain the values in the table, let's hover over a specific cell:

Based on the color of this cell, we can see that this dataset lacks images with the word girl in the caption and the label Baseball park from Cloud. In particular, it shows:

1061 images with Baseball park (row counter).
4057 images with girl (column counter).
11 images with Baseball park & girl, but expected 112 by chance, which is 10.18X less than expected.

The expected count is computed based on the global counts of Baseball park and girl assuming no correlation between those signals.

Sort Order & Searching

Since the table can have many rows (e.g. thousands for Cloud labels), the UI shows the first 100 rows and provides a search box at the top of the page to search for specific rows. The rows are automatically sorted by strongest correlation in any of its cells.

The search box is useful for testing a specific hypothesis. For example, we can search for images with the Kitchen label, which shows overrepresentation with the word woman and underrepresentation with the word man:

Interactive exploration

The Relations table is interactive. We can click on a row, a column, or a cell and immediately see the images behind the derived correlation. For example, we could click on the Kitchen row and see images that Cloud Vision labeled as Kitchen, while visually inspecting for any correlation with perceived gender. We can also click on the Kitchen & man cell to see images in that intersection:

Statistically significant correlations

To avoid surfacing spurious correlations, KYD uses a 95% confidence interval and computes a conservatively adjusted correlation strength. We compute this by taking the upper or lower 95% bound of the expected count, whichever one results in a weaker correlation. The table does not show this number directly. Instead, the adjusted correlation strength is used to color the cell and to sort the rows. Therefore cells with strong correlations that are not statistically significant won't be colored strongly, and their rows won't be ranked highly in the table.

The example below showcases this where the Computer desk & female cell is not colored blue because there was only 4 observed images, and the correlation strength of 1.24x is not statistically significant: