Data Cleaning: DIY or outsource for reducing bias in AI?
“Data’s value hinges on diversity in both the sets of information that make it up and the perspectives necessary to comprehensively interpret it.”– Srujana Kaddevarmuth, Data Science & Value Realization Director at Walmart Labs
“The most challenging part of building a new AI system isn’t the algorithms or the models but rather collecting the right data and labeling it correctly so that a machine can begin training with and learning from it.”– Amy Webb, Futurist @Future Today Institute
Netflix’s popular documentary- The Social Dilemma- painted social media experimental algorithms as the evil villain but AI biases take root pre-algorithm. They’re rooted in data sets that are designed to teach machine learning.
In a data science lifecycle, data collection followed by data cleaning or scrubbing) takes up a significant amount of production time and this largely undesirable task, is typically designated to junior data engineers, analysts or data scientists.
79% of DATA SCIENCE IS COLLECTION & CLEANING
69% of DATA SCIENTISTS ARE MEN
GLOBALLY, MEN OUTNUMBER WOMEN IN AI RESEARCH AUTHORSHIP
Furthermore, in a study accepted by the Navigating Broader Impacts of AI Research at the 2020 NeurIPS machine learning conference, the researchers conclude that biased predictions are mostly caused by imbalanced data but that the demographics of engineers and AI research authors also play a role.
These facts set context for the series of questions data science teams should pose before deciding whether to outsource or keep data cleaning/annotating in-house.
- What was the origin source of your data set collection?
- Who on your team collected or curated it? A few select individuals or hundreds?
- Is your data set representative of racial, gender and cultural diversity?
- Is the makeup of your data science team representative of racial, genders and cultural diversity?
If you answered a few or no to any of the questions above, chances are… both the data set and your team responsible for collection and cleaning might be unintentionally infusing human bias into the data science input process and consequently, your AI project outcome.
Consider outsourcing data annotation & cleaning tasks to an objective 3rd party company that offers a diverse staff and independent tools to help your company reducing AI bias before beginning data visualization, model selection/training/scoring/deployments.
Below are a few companies offering solutions and services to pro-actively mitigate AI bias within the earliest stages of the data science lifecycle.
Additionally, the firms below go one step further in the lifecycle and help identify & root out unfair bias in model predictions with tools and monitoring platforms.