Akshay Walimbe

500 Million Indians Are Invisible to AI

500 Million Indians Are Invisible to AI

500 Million Indians Are Invisible to AI

By Akshay A. Walimbe

There is a woman in a village in Jharkhand. She is sixty years old. She has spent her life as a farm worker  planting rice, harvesting maize, carrying loads that most of us cannot imagine. Her hands show every year of that work. The fingerprint ridges that you and I take for granted have been sanded down by decades of soil, brick, and rough grain. Smooth. Almost featureless.

When her state government rolled out Aadhaar linked biometric authentication for food rations, she walked to the local fair price shop, pressed her thumb against the scanner, and was denied. Authentication failed. She tried again. Failed. She tried with a different finger. Failed. The machine could not read what her life had worn away.

The system had a backup. An OTP  a one time password  could be sent to her mobile phone. But she does not own a mobile phone. The bypass for the biometric failure requires a device that 500 million rural Indians do not have.

She walked home without rations.

This is not an isolated story. According to Indian government data and field research by economists Jean Dreze and Reetika Khera (published in the Economic and Political Weekly), the biometric failure to match rate in Jharkhand was 49 percent. Nearly half of all attempts to authenticate failed. In Rajasthan, it was 37 percent. Field researchers found that biometrics “consistently failed for people who depend on manual labour for a living.” The elderly. The people who wash dishes, lay bricks, handle chemicals, work salt fields. The people whose bodies carry the evidence of hard work. The machine reads that evidence as absence.

Between 2015 and 2018, the Right to Food Campaign investigated 57 starvation deaths across nine Indian states. At least 19 were directly linked to Aadhaar problems. An 11 year old girl named Santoshi Kumari, from Simdega district in Jharkhand, died of starvation in September 2017 after her family’s ration card was cancelled because it was not linked to Aadhaar. The local dealer refused them rations for six months. In the twelve months after her death, at least 37 more starvation deaths were recorded. Thirteen were linked to Aadhaar authentication failures.

Let me be clear about something. Aadhaar has also delivered genuine benefits. It has helped reduce leakages and corruption in subsidy delivery. Direct benefit transfers linked to Aadhaar have saved the government significant amounts by eliminating ghost beneficiaries. UIDAI has since introduced face authentication as an alternative to fingerprints, and exception handling mechanisms exist on paper. The system was not built with bad intent.

But in practice, those exception handling mechanisms are often unavailable at rural ration shops. The backup OTP system requires a mobile phone. The face authentication requires infrastructure that many villages lack. The system was built to prevent fraud. For the most vulnerable, it ended up preventing food.

The Map With Half the Country Missing

According to the IAMAI Kantar Internet in India 2024 report, India has 886 million active internet users. That number sounds impressive until you look at who is not included. Approximately 630 million Indians remain offline. And 500 million of those offline Indians live in rural areas.

Let me paint the picture with numbers.

According to TRAI data, urban internet penetration in India exceeds 111 percent. That is not a typo. It means many urban Indians have multiple internet connections. Rural internet penetration is between 35 and 45 percent, depending on which study and measurement methodology you trust — TRAI reports 44.99 percent based on subscriber data, while other estimates using active user data come closer to 35 percent. The gap is not a crack. It is a canyon.

And the people on one side of that canyon are almost entirely absent from the data that trains AI systems.

Think about what AI training data looks like. It comes from internet activity — searches, posts, comments, reviews, transactions, clicks, locations, conversations. Every major AI model in the world was trained primarily on internet text. The people who generate that text are overwhelmingly urban, English speaking, relatively affluent, and digitally literate. In India’s specific context, researcher Nithya Sambasivan and colleagues at Google Research documented (in a paper published at ACM FAccT 2021) that AI training data comes disproportionately from “middle class Indian men who have internet access,” while over half the population — primarily women, rural communities, and tribal communities  lack access to the internet entirely.

If you are online, you exist to the algorithm. Your preferences, your language patterns, your behaviours, your needs  they are all captured, processed, learned. If you are offline, you are invisible. You contribute no data. You shape no model. You train no algorithm. And yet, increasingly, algorithms are being deployed to make decisions about your life.

This is not a technology problem. It is a representation problem. And in a democracy of 1.4 billion people, it is a crisis.

When Invisible People Meet Visible Algorithms

Here is where the data gap becomes dangerous.

India’s government and private sector are deploying AI systems at breathtaking speed. AI in healthcare. AI in agriculture. AI in lending. AI in welfare distribution. AI in policing. AI in education. These systems do not wait for data equity before they go live. They deploy now, with the data they have. And the data they have is a portrait of urban, connected India masquerading as a portrait of the whole country.

Consider healthcare AI. If a diagnostic model is trained primarily on data from urban hospitals — patients with access to imaging equipment, electronic health records, specialist consultations — how does it perform when deployed in a rural primary health centre? The disease presentations are different. The available diagnostic tools are different. The patient demographics are different. The nutritional baselines, the occupational exposures, the environmental factors — all different. A model that has never seen the patterns of rural India will misread them. Not out of malice. Out of ignorance. You cannot recognise what you have never been taught.

Consider agricultural AI. India is an agricultural country where, according to World Bank data, roughly 42 percent of the workforce is in farming. AI tools that advise on crop selection, pest management, irrigation timing, and market prices are being developed rapidly. But the training data for these tools comes largely from digitally connected farms — larger operations with smartphones, internet access, and the literacy to use apps. The smallholder farmer with two acres, no smartphone, and generations of local knowledge about soil and weather? Invisible. The algorithm will give advice tuned for a reality that is not theirs.

Consider credit scoring. While the Jan Dhan Yojana scheme has brought hundreds of millions of Indians into the formal banking system, a significant portion of those accounts remain inactive or barely used. The World Bank Global Findex 2021 data showed that while roughly 78 percent of Indian adults had accounts, about 35 percent of account holders had inactive accounts — meaning access to banking on paper does not equal meaningful financial inclusion in practice. CIBIL maintains credit records for over 600 million individuals, but those records primarily cover people in the formal economy — salaried workers with bank accounts and documented repayment histories. When fintech companies try to fill the gap with “alternative credit scoring” powered by AI, they use proxy data — phone model, PIN code, app usage, digital transaction patterns. Every one of these proxies is biased against rural, offline populations. If your digital footprint is thin because you live in a village without reliable internet, the algorithm reads that as risk. Not as the absence of opportunity.

The Numbers Behind the Silence

Here is what the data gap actually looks like.

Only 27 percent of rural internet users in India are digitally literate. That means even among the rural Indians who are online, three out of four struggle to use the internet effectively for the digital services that generate the data AI learns from.

Fifty eight percent of female internet users in rural India are shared device users. They do not have their own phones. They borrow. That means their usage is fragmented, inconsistent, and often invisible to the algorithms tracking individual user behaviour.

According to multiple digital India reports, the poorest states Bihar at roughly 43 percent internet penetration, Uttar Pradesh at roughly 46 percent, Jharkhand at roughly 50 percent  are the same states where welfare exclusion through Aadhaar is worst, where healthcare access is most limited, where the need for well functioning AI mediated government services is greatest. Compare that with Kerala at 72 percent or Goa at 71 percent. The places that need good AI the most are the places with the least data to build it on.

There is reason for hope in the numbers. According to the IAMAI Kantar report, rural India now leads in absolute numbers of new internet users  it has been adding more new users than urban India for four consecutive years. The smartphone gap is closing. The raw access gap is narrowing. Women now make up 47 percent of all internet users, the highest proportion to date.

But the quality and depth of digital engagement  the kind that generates meaningful training data  remains overwhelmingly urban. Watching a YouTube video on a shared phone is not the same as maintaining a digital financial history, generating search data, producing the text and transactions that AI models feed on.

The gap is not just about who is online. It is about who is online in a way that the machines can learn from.

The Compounding Problem

Here is what makes this not just unfair but dangerous.

When AI systems trained on urban data are deployed for everyone, they perform worse for rural populations. That poor performance reduces trust. Reduced trust means rural populations engage less with digital systems. Less engagement means less data from rural populations in the next training cycle. Less rural data means the next generation of models is even more skewed toward urban patterns.

This is a feedback loop. And like all feedback loops, it accelerates over time. The gap does not stay the same. It widens.

We saw this with Aadhaar. The biometric system was designed with fingerprint and iris recognition trained on data that worked well for urban, younger, white collar populations. When deployed to rural manual labourers whose fingerprints were worn down, it failed at rates approaching 50 percent. Instead of rebuilding the system to work for its hardest to serve users, the workaround relied on OTPs — which required mobile phones that the most excluded populations did not have. The exception handling mechanism excluded the same people the primary system failed.

This is what happens when invisible populations meet systems built by people who have never been invisible. The engineers are not cruel. They are simply building for the world they know. And the world they know has reliable internet, smartphones, and fingers that scan cleanly.

The Question That Should Keep Us Up at Night

India is building its digital future at an extraordinary pace. Digital public infrastructure  Aadhaar, UPI, DigiLocker, ONDC  is the backbone of a vision that places India at the forefront of the digital world. AI is the next layer of that vision. AI for governance, AI for commerce, AI for health, AI for education, AI for justice. And the government’s investments in rural connectivity — BharatNet, PM WANI, subsidised smartphones — are real and meaningful efforts to close the access gap.

But every AI system is only as good as the data it learns from. And right now, that data disproportionately represents the connected urban half of the country. The half that already has access, already has voice, already has power. The gap is narrowing, but the AI systems being built today are being built with the data that exists today — not the data that will exist in five years.

What about the other half?

Five hundred million Indians who are offline. Millions more who are online but digitally illiterate. Women who share devices. Manual labourers whose bodies fail the machines designed to identify them. Entire communities whose languages, needs, patterns, and realities are absent from the datasets that train the systems being built to serve them.

We are building the infrastructure of an AI powered nation. We are building it fast. We are building it with ambition and energy and genuine intent.

But we are building it with a map that has half the country missing.

If you are not in the data, you do not exist to the algorithm.

I’m have written a book about exactly this  how AI and automated systems make decisions about your life, where accountability disappears, and what we can do about it. If you want to know morea about this book or order a copy, you can do it here: https://akshaywalimbe.com/beyond-bias/

Scroll to Top