Know Your Data
Frequently Asked Questions

How does KYD work?

KYD allows users to explore the dataset by information that wasn’t originally in the dataset. The tool annotates the existing data using machine learning models like Cloud Vision labels, Cloud Vision face detection, and general image quality metrics (e.g. sharpness and brightness). We compute these annotations in an Apache Beam pipeline. We then load the database in a web server to enable interactivity. The statistics are computed on the fly, each time you click.

Can I run Know Your Data on my own data?

Not yet. For now, we’re only serving Know Your Data for image-based datasets supported by the TensorFlow Datasets API.

Will KYD work for other types of data?

Yes. We are actively working on other modalities, including text datasets.

What about model analysis?

We are actively thinking about model analysis in KYD.

How do I report an issue?

For any bugs and feature requests related to the application, please file a bug. For any other questions, issues or concerns, email us at knowyourdata-feedback@google.com.

Will KYD add more signals?

KYD will continue to develop over time and we are very interested in making datasets more understandable by adding more signals. If you have a specific signal in mind, please let us know.

Why doesn’t KYD use automated signals for protected attributes (for example, perceived gender expression, age or skin tone)?

Although we are aware that signals on protected attributes can help tackle fairness issues in datasets, labeling individuals in datasets could lead to undesirable outcomes. We seek to avoid creating or reinforcing unfair bias and are looking into trusted responsible approaches to adding more signals before proceeding forward.

Why are not all of the datasets supported by the TFDS API enabled in KYD?

KYD features datasets with the appropriate licenses for us to serve them. We accept Apache 2.0, MIT and Creative Commons. If you don’t see a TFDS supported dataset enabled in KYD, it’s likely that it may not have one of these licenses. If you are a dataset author interested in visualizing your dataset in KYD, please let us know.

How can I learn more about the performance of KYD’s machine learning models?

A key design principle of KYD is to always show the underlying data points to earn user trust. In addition, we offer details on KYD’s machine learning (ML) models. To understand their overall performance and limitations you can look at the publicly available model cards.