Provenance of ML training data

One issue with Machine Learning in general, but Deep Learning in particular, is that the data a model is trained on has significant impacts to the resulting model. The problem comes when we don’t know the provenance (i.e. where it came from) of that information.

This can introduce difficult-to-detect bias if the training set has built-in bias. As an example, if we were to crawl the web and use that as a training set.. we would disproportionately include the view-points of folks who publish content on the web. These people are disproportionately members of the Global North (example).

There are suites of tools like Apache Atlas that help by tracking the data-lineage of how data is gathered, use, transformed.

The notes of Justin Abrahms

Recently updated

Sprint Ceremony input/outputs

Calculating velocity

Story points

Explorer

Provenance of ML training data

Graph View

Backlinks