Amazon Built an AI to Hire the Best People. It Taught Itself to Reject Women.
Amazon Built an AI to Hire the Best People. It Taught Itself to Reject Women.
Part of “The AI You Don’t See” series by Akshay A. Walimbe
In 2014, Amazon decided to do what Amazon does best: automate.
(Based on reporting by Jeffrey Dastin, Reuters, October 10, 2018)
The company built an AI powered recruiting tool. The idea was simple and, on paper, brilliant. Feed the machine ten years of resumes submitted to Amazon. Let it learn what a “good” candidate looks like. Then point it at new applicants and let it rate them one star to five stars, just like Amazon product reviews.
The goal was to find the best talent faster than any human recruiter could. The machine would scan resumes, identify patterns that predicted success, and deliver a shortlist. No human bias. No gut feelings. Just data.
By 2015, the team had a problem. The machine had learned something from a decade of resumes that nobody had explicitly taught it.
It had taught itself that men were better candidates than women.
Let me walk you through how this happened, because it is not obvious, and that is precisely what makes it dangerous.
Amazon’s tech workforce, like most of the technology industry globally, has been predominantly male. The tech industry overall is roughly 70 per cent male, and has been for decades. So when the AI analysed ten years of resumes from people who were hired and performed well at Amazon, the overwhelming majority of those “successful” resumes came from men.
The machine did what it was designed to do. It found patterns. And the pattern it found was: successful candidates tend to be male.
But here is the critical part. Nobody told the machine to look at gender. Gender was not an input field. The machine was never given a column labelled “male” or “female.” It figured it out on its own.
How?
According to Reuters, the system crawled through approximately 50,000 key terms and attributes across 500 different computer models, looking for patterns that predicted a five star candidate. In that ocean of data, it found proxy variables terms and signals that correlated with gender without directly stating it.
The word “women’s” became a penalty. If your resume mentioned “women’s chess club captain,” you were downgraded. If it said “women’s rugby team,” downgraded. The word itself was neutral it could appear in any context but in the training data, it correlated with female candidates, so the system learned to treat it as a negative signal.
Graduates of two specific all women’s colleges (whose names were never publicly disclosed by Reuters’ sources) were systematically scored lower. Not because the education was worse. But because attending a women’s college correlated with being female, and being female correlated with not being in the historical pool of “successful” Amazon hires.
Reuters also reported that the system learned to favour certain verbs. Words like “executed” and “captured” the kind of action verbs more common on male engineers’ resumes became positive signals. Not because these words indicate better engineering. But because they appeared more frequently in the resumes of the people who had historically been hired, and those people were mostly men.
Amazon’s engineers were not naive. They spotted the problem, and they tried to fix it.
A dedicated team at Amazon’s Edinburgh engineering hub growing to about a dozen people went to work. Their first approach was direct: tell the machine to ignore the biased terms. Make “women’s” a neutral word. Remove the penalty for women’s colleges. Strip out the gendered verb patterns.
It did not work.
Here is why, and this is the part that should genuinely worry you.
With 50,000 features in the model, gender is not encoded in one place. It is woven through the data like a thread through fabric. Pull out one thread, and the pattern shifts but does not disappear. Remove the word “women’s,” and the system finds another proxy. Maybe it starts weighing certain hobbies differently. Maybe certain writing styles. Maybe certain career trajectory patterns that happen to differ between men and women.
The engineers tried to make the system gender neutral. But the data it was trained on was not gender neutral. The history it was learning from was not gender neutral. The entire ten year record of who Amazon hired and promoted reflected the existing gender imbalance in tech. And no amount of surgical term removal could extract that bias from the bloodstream of the data.
As the American Civil Liberties Union (ACLU) put it in their analysis of the case: “If you simply ask software to discover other resumes that look like the resumes in a ‘training’ data set, reproducing the demographics of the existing workforce is virtually guaranteed.”
By the start of 2017, Amazon’s executives had seen enough. They lost confidence that the tool could ever be made gender neutral. The Edinburgh team was disbanded.
Reuters broke the story in October 2018, based on interviews with five people familiar with the project. By then, the recruiting tool had been reduced to a “much watered down version” that did little more than cull duplicate profiles from databases. The star rating system, the resume screening, the automated shortlisting all of it was gone.
Amazon had spent years and significant resources building a tool that was supposed to remove human bias from hiring. Instead, it automated human bias at scale.
To be fair, Amazon did the right thing in shutting the tool down. Reuters reported that a new team was subsequently formed in Edinburgh to attempt automated employment screening again, this time with a focus on diversity though details of that effort have not been publicly disclosed.
Now, I want you to think about what this means outside of Amazon.
Amazon had something most companies do not: the technical talent to diagnose the problem, the resources to spend years trying to fix it, and the integrity to shut it down when they could not.
What about the companies that do not have a dozen person Edinburgh team? What about the startups deploying off the shelf AI hiring tools without the engineering capability to audit them? What about the HR departments using automated resume screeners right now, today, without any awareness that the tool might be penalising candidates for attending a women’s college or listing a women’s sports team on their resume?
Because Amazon is not the only company that has built a hiring algorithm. It is just the one that got caught, investigated, and publicly reported. The Reuters investigation exists because insiders talked. How many other systems are running right now with similar biases that nobody has tested for?
There is a deeper lesson here that goes beyond hiring.
The Amazon story reveals a fundamental truth about AI: the machine does not know what is fair. It knows what is frequent. It does not understand justice. It understands patterns. And when the patterns in your historical data reflect decades of inequality, the machine will learn inequality and call it accuracy.
The AI was not malicious. It did not hate women. It did not have opinions about gender. It had training data. And that training data ten years of resumes from a male dominated industry was a mirror. The machine looked at Amazon’s past and predicted Amazon’s future as more of the same.
This is what bias looks like in the age of AI. It does not wear a label. It does not announce itself. It hides inside 50,000 features and proxy variables and statistical correlations. It penalises you not for being a woman, but for using a word, attending a college, or writing your resume in a style that correlates with being a woman. And the end result is the same.
Rachel Goodman of the ACLU called it plainly: “These tools are not eliminating human bias they are merely laundering it through software.”
Amazon shut their tool down. That was the right call. But shutting it down is not the same as solving the problem.
Across India and around the world, AI powered hiring tools are being deployed at a pace that far outstrips our ability to audit them. Companies adopt them because they are efficient. They screen thousands of resumes in seconds. They promise objectivity. They promise to remove the biases of individual human recruiters.
But if the training data carries bias, the tool carries bias. And unlike a human recruiter, whose biases can be challenged, trained out, or held accountable, an AI system’s bias is invisible until someone specifically tests for it.
Which brings me to the question I cannot stop thinking about.
Right now, somewhere in India, an AI tool is screening resumes for a job you might apply for. It was trained on historical hiring data. That data reflects years of patterns who got hired, who got promoted, who stayed. Those patterns reflect every structural inequality that exists in the workforce: gender, caste, class, the type of school you attended, the language you write in, the PIN code on your address.
The machine is learning from all of it. It is finding patterns you and I cannot see. And it is making decisions.
How do you test if your AI is fair?
I’m have written a book about exactly this how AI and automated systems make decisions about your life, where accountability disappears, and what we can do about it. If you want to know morea about this book or order a copy, you can do it here: https://akshaywalimbe.com/beyond-bias/