What’s in an address? Shakespeare might have pondered his famous question differently had he been a data scientist in the age of artificial intelligence and machine learning. Especially so if he were a data scientist at Flipkart. Businesses operating in the e-commerce and last-mile logistics sectors in India may empathize with this problem statement, for inaccurate and inconsistently formatted addresses result in massive revenue setbacks every year. In an interview with Flipkart Stories published in December 2017, in which he articulated Flipkart’s “AI For India” vision, Flipkart co-founder and executive chairman Sachin Bansal exhorted practitioners working in the domains of artificial intelligence, including deep learning and machine learning, to rise to the challenge of solving complex and uniquely Indian problems. Not surprisingly, among the plethora of problem statements challenging data scientists and AI-ML engineers is that posed by the complexity of addresses in India.
In a research paper titled Geographical address classification without using geolocation coordinates published in 2015 and archived at the ACM (Association for Computing Machinery) Digital Library, Dr T Ravindra Babu and other data scientists at Flipkart recognize that addressing systems in India do not yet have “an organized form where they can directly be correlated with a geolocation”, positing the argument that as far as published literature goes, this is a problem unique to India.
“In the countries where there exists a structure in addresses both in terms of definitions and the way people that follow them, the solution to the problem is relatively less challenging,” the authors write. However, they add, in countries where people do not follow the structure, owing to diversity and literacy levels, the challenge grows immensely. It was against the background of such challenges that this team of data scientists working at Flipkart developed and successfully demonstrated a working system for address classification.
The significance of their work goes beyond the realms of academia into applied business practice. By integrating novel contributions from their research — which included probabilistic separation of compound words, data-dependent machine learning-based dictionary models, and methods of detecting and eliminating fraudulent addresses — the team of Flipkart data scientists arrived at a robust working address classification system that would vastly improve the efficiency of last-mile delivery in logistics besides detecting and eliminating address-based fraudulent practices. Arguably, this AI innovation for India has far-reaching consequences.
PIN codes are inadequate for Digital India
By definition, postal addresses are required to be consistent to enable correct and efficient delivery of mail or shipments, including such important parcels as passports, bank statements, utility bills, domestic gas connections, etc. In an ideal world, addresses should be structured in a hierarchical manner; i.e. by country, state, district, city/town, locality, and so on. However, postal addresses in India are neither consistent nor do they accurately convey or represent geolocation.
Locating an address in India can be a bit of an adventure in itself. In the absence of a formal geospatial classification based on latitude-longitude coordinates, postal addresses are organized by Postal Index Number, or PIN, colloquially known as pincodes or PIN codes. However, each individual PIN code may correspond to an area as large as 50 square kilometers. The PIN system, which was implemented in 1972, is inadequate to serve consumers who access services over the internet and cellular data networks.
MapMyIndia, the location-based technology solutions company in which Flipkart has invested, attempts to overcome the problems of PIN code-based addresses with a digital address system called eLoc. This is a six-character digital address that uniquely identifies any location in India based on digital parameters and deep mapping technology. MapMyIndia claims that the last-mile accuracy of this system is more precise than that offered by commonly used consumer mapping services.
One address, many voices, multiple versions
Even as technology helps close the gap, it has a long way to go before it can solve the problem of non-standard address structures. Street addresses in Lutyens’ Delhi, for instance, follow a different structure compared to those in other localities of the city such as Chandni Chowk or Hauz Khas Village. Some areas of Bengaluru, for example, use labels like “blocks”, “mains” and “crosses” to designate street addresses while others in the same city follow nomenclature such as “phases” and “sectors”, with or without house numbers, house names or street names. When new suburbs or villages are absorbed into the ever-expanding urban meshwork, they may not necessarily adhere to traditional naming conventions.
The majority of Indian place names are translated phonetically into English from local languages. Official place names may differ from conventional names, and the resulting address may contain highly variable spelling patterns. In addition, they may include typographical errors, or clerical errors introduced by errant gazetteers.
A headache for last-mile logistics
To cut a long story short, in the absence of clear and consistent naming or zoning conventions, the correct resolution of street addresses throws a considerably challenging problem before data scientists. Not to mention a frustrating headache for businesses whose competitiveness and profitability hinge on the speed and efficiency of last-mile logistics.
Consider the effect that inconsistent addresses have on e-commerce. For one, delayed delivery erodes customer satisfaction. Improper addresses cause rerouting and return of shipments, placing a high load on the logistics network, not to mention incurring a high cost on logistics, transport, customer support and complaint resolution. This additional cost is borne by the business, and it eats into profits.
In the Flipkart context, delivery addresses are user-generated. Customers input addresses by filling in the required online forms during the user registration process on the Flipkart platform. A chief address determinant is the user’s PIN code, which is captured and mapped using internal systems. However, owing to lack of clarity, customers may enter incorrect PIN codes. Or they may commit spelling errors while filling up other form fields. In many cases, the addresses entered are incomplete, with one or more form fields missing.
There are sociological factors to take into account here. Adult literacy rates vary widely across India. The majority of transactions on digital applications are conducted in English. Literacy in the mother tongue is a limiting factor in on-boarding the majority of Indians onto digital commerce, while the much lower percentage of Indians who are literate in English adds a further layer of inconsistent data.
Addressing the problem with ML
To get to the heart of the matter, the Flipkart data scientists visited the company’s warehouses and delivery hubs to observe how personnel at the sorting facility understood and classified addresses. They noted a deep application of tacit knowledge in the process. The sorting personnel played a role in identifying addresses intended for each Delivery Hub based on PIN code and location information. Further, the eKart wishmasters at each Delivery Hub, who are responsible for last-mile delivery of a shipment to the customer, also had a strong familiarity with the addresses on their routes.
“Machine learning had to understand how their unique knowledge could be converted into a model,” says Dr Ravindra Babu, adding that at the outset the brief was rather vague. “My engineering leader only told me we needed to do something about addresses.”
Here was the challenge that lay before the data scientists: Can machine learning be used to understand and make sense of these complex and highly inconsistent addresses to improve the delivery efficiency of last-mile logistics? Further, could this model be applied to detect and eliminate address fraud?
Flipkart’s data scientists purported to approximate the mental model — the labels that the sorting personnel employed to classify an address correctly — and build a machine learning model that would attempt to learn incrementally from these labels and patterns to reduce the error rate.
The machine learning solution that the Flipkart data scientists and engineers built was, as Dr Ravindra says, “our own innovation to solve a very Indian problem.” Working with the field executives and supervisors at the hub, the data scientists collected and validated the data sets. Then came the process of labeling the addresses.
“The way the human mind perceives these things is different from the way a machine understands data, so that was the challenge before us,” says Dr Ravindra Babu. “Moreover, any machine learning model will have inaccuracies.”
The challenges, Dr Ravindra adds, included identifying representative features, encoding a field executive’s domain knowledge, pre-processing stages (which themselves contained a few models at work), and the choice of an appropriate blend of supervised and unsupervised learning approaches.
Cleansing the datasets was an enormous task but the pace picked up once the machine was trained with clean data. “It took four months to build a machine learning model and, within a couple of months, the engineering solution was built,” says Dr Ravindra Babu.
Protecting customers from address fraud
If machine learning can be implemented to play good cop, it can also play bad cop to deter fraudulent practices.
The Flipkart shopping platform is a consumer retail platform; in other words, it intends to make quality products accessible and affordable to end customers, not middlemen or resellers. In the normal course, a retail customer is limited to purchasing a certain number of high-value items (such as a mobile phone or electronic appliance) during a single shopping session in order to enable more customers to have access to these coveted products.
Such guardrails are put in place to to deter resellers and ensure that retail customers have access to quality products at affordable prices. However, some customers, who are potential resellers, exploit this system for dishonest gains. Reseller fraud is a kind of malpractice where small business owners pose as online shoppers. They purchase articles such as mobile phones at a discount from online shopping sites and then sell them in the retail market at a profit. This is illicit, and tracking down the perpetrators of such fraudulent practices is a challenge for e-commerce businesses.
“We have a machine learning model in place to identify and blacklist resellers,” says Dr Ravindra Babu, explaining that the machine has been trained to capture the signals used to identify a reseller. Similar models, he says, have been deployed to tackle different kinds of fraud and they have helped the company plug potential revenue setbacks amounting to millions of dollars.
AI & ML problems whet the appetite of data scientists
The impact of inconsistent addresses on last-mile logistics is a problem unique to India. In the absence of standard textbook solutions, Flipkart’s data scientists rose to the occasion. Their efforts helped Flipkart to develop a solution to classify complex and inconsistent addresses with an accuracy rate of 98% (read the published paper for details), besides being able to identify signals that could detect potential address-related fraud.
So, what inspired these data scientists? Simply put, it was the novelty of the problem, and the availability of real data.
In contrast to research institutes, Flipkart provides interesting practical challenges unique to India where data scientists need to arrive at formal mathematical problem statements from broad business directions. Subsequently, they need to define the patterns and features, and develop an effective machine learning model.
“Are these unseen and native problems? Can I solve them from first principles?” asks Dr Ravindra Babu, noting that such questions get a data scientist’s blood up. “And that’s what makes it thrilling.”
And what is the mettle of the data scientist who can live up to Flipkart’s expectations? Typically, they would hold doctorate and master’s degrees from reputed universities and would have worked deep in one field, solving problems in depth and breadth. The litmus test, says Dr Ravindra Babu, is whether they can solve problems that they have never solved before, and if they have an appetite for solving native problems. That is the opportunity that Flipkart can offer.
Additional inputs from Sumanta Dey. Photographs and infographic: Arjun Paul