Incident 12: Common Biases of Vector Embeddings

Description: Researchers from Boston University and Microsoft Research, New England demonstrated gender bias in the most common techniques used to embed words for natural language processing (NLP).

Tools

New ReportNew ReportNew ResponseNew ResponseDiscoverDiscoverView HistoryView History
Alleged: Microsoft Research , Boston University and Google developed an AI system deployed by Microsoft Research and Boston University, which harmed Women and Minority Groups.

Incident Stats

Incident ID
12
Report Count
1
Incident Date
2016-07-21
Editors
Sean McGregor

CSETv1 Taxonomy Classifications

Taxonomy Details

Harm Distribution Basis

sex

Sector of Deployment

professional, scientific and technical activities

CSETv0 Taxonomy Classifications

Taxonomy Details

Full Description

The most common techniques used to embed words for natural language processing (NLP) show gender bias, according to researchers from Boston University and Microsoft Research, New England. The primary embedding studied was a 300-dimensional word2vec embedding of words from a corpus of Google News texts, chosen because it is open-source and popular in NLP applications. After demonstrating gender bias in the embedding, the researchers show that several geometric features are associated with that bias which can be used to define the bias subspace. This finding allows them to create several debiasing algorithms.

Short Description

Researchers from Boston University and Microsoft Research, New England demonstrated gender bias in the most common techniques used to embed words for natural language processing (NLP).

Severity

Unclear/unknown

Harm Distribution Basis

Sex

AI System Description

Machine learning algorithms that create word embeddings from a text corpus.

Relevant AI functions

Unclear

AI Techniques

Vector word embedding

AI Applications

Natural language processing

Location

Global

Named Entities

Microsoft, Boston University, Google News

Technology Purveyor

Microsoft

Beginning Date

2016-01-01T00:00:00.000Z

Ending Date

2016-01-01T00:00:00.000Z

Near Miss

Unclear/unknown

Intent

Unclear

Lives Lost

No

arxiv.org · 2016

The blind application of machine learning runs the risk of amplifying biases present in data. Such a danger is facing us with word embedding, a popular framework to represent text data as vectors which has been used in many machine learning…

Variants

A "variant" is an incident that shares the same causative factors, produces similar harms, and involves the same intelligent systems as a known AI incident. Rather than index variants as entirely separate incidents, we list variations of incidents under the first similar incident submitted to the database. Unlike other submission types to the incident database, variants are not required to have reporting in evidence external to the Incident Database. Learn more from the research paper.

Similar Incidents

By textual similarity

Did our AI mess up? Flag the unrelated incidents