One issue with Machine Learning in general, but Deep Learning in particular, is that the data a model is trained on has significant impacts to the resulting model. The problem comes when we don’t know the provenance (i.e. where it came from) of that information.

This can introduce difficult-to-detect bias if the training set has built-in bias. As an example, if we were to crawl the web and use that as a training set.. we would disproportionately include the view-points of folks who publish content on the web. These people are disproportionately members of the Global North (example).

There are suites of tools like Apache Atlas that help by tracking the data-lineage of how data is gathered, use, transformed.