In 2018, when the world was different and COVID-19 could have been the name of a video-game, speaking to your Flipkart app in Hindi would have sounded like something out of a sci-fi film. But now, thanks to the newly launched Flipkart voice assistant feature, you can talk as naturally as you would in normal life when you communicate with your local shopkeeper. Don’t believe us? Try it out yourself!
Two years ago, Flipkart was laying the groundwork for an acquisition that would enable this paradigm shift in Indian e-commerce. Liv.ai, an Indian voice-recognition startup founded in 2015 by three IIT-Kharagpur graduates, was on the verge of joining the Flipkart family. One of the first Indian companies to develop a highly accurate speech-to-text platform in 10 major Indian languages, they were pioneering work in the field of AI-driven natural language processing.
But what does Artificial Intelligence have to do with the Flipkart voice assistant? To understand this, one has to take a closer look at India’s shopping culture.
The next 200 million customers
“Flipkart, over the last 10 years, has done an excellent job of building an e-commerce platform for the first set of users that had the luxury of access to the internet on mobile devices and also that extra income to spend on technology and e-commerce. After these people try out e-commerce and realize they get excellent value and selection, the word-of-mouth spread. Now, Flipkart is a global company,” explains Meet Jariwala, Director of Product at Flipkart.
Currently in his second stint at Flipkart, after a career spanning over two decades building products in small and large-sized companies across multiple countries, Meet has seen the technological landscape evolve to become more natural and accessible. According to him, although Flipkart’s rise was rapid, there was still a large group of potential customers the company was not serving.
“The reach was still limited to 150-200 million users, which is not a full buy-in from everyone who has an internet-capable device. So, naturally, the second question we asked ourselves was this — what about the other internet users who are not shopping online, and why are they not shopping online?” says Meet. As this Flipkart team chased the answers to these questions, they discovered that e-commerce wasn’t as convenient to everyone as they thought it was.
Enjoying this story? Check out the accompanying podcast
“We found that from a user’s perspective, there were many issues that they faced while interacting with our app. For starters, some of our pages can be too overwhelming in terms of content,” explains Pooja Dhaka, User Experience Researcher at Flipkart.
“A lot of our users were also not totally comfortable in English, so it could be a challenge for these people to find a product on Flipkart, especially if they have to search for the name of the product in English,” she says. Although Flipkart has since made this particular process easier for users by introducing support for regional language content and keyboards, the user research conducted by Pooja and her colleagues revealed interesting insights into the shopping habits of these users — something that called for an entirely new solution — the Flipkart voice assistant.
A billion ways to make a bill
“What we’re seeing around the world currently is a broader shift in computing. The big shift towards mobile from desktop and even from cell-phones to smartphones — from typing to swiping — was a huge shift in the computing paradigm,” says Nandini Stocker, Senior Product Design Manager at Flipkart, adding, “Voice is the next step in that shift.” A veteran in the field of voice technology, she has closely followed the industry as it has matured.
“Sometimes it’s just to do something quick and cut through something that you would otherwise tap or type, because it’s simply faster. Or it could just be an accessibility feature that you could use to navigate. Either way, the overarching theme here is that it’s an enabling technology. It’s something that we naturally know from birth,” she explains.
“You don’t have to teach people how to talk,” adds Meet.
“The team already had a limited prototype for a voice interface,” explains Pooja. “We conducted a concept study to understand grocery buying behavior and explore how this interface would work for online grocery shopping. Our initial assumption was that people would use voice more than other modes of interaction like touch, but it didn’t happen that way; they used a combination of both. We concluded that both these modes of interaction should be heavily integrated,” she says.
Soon, the team realized that in India, diversity didn’t stop at the number of languages. Even within a single language, there were dozens of regional dialects, and within these dialects, there were often different ways of expressing the same idea. Pooja illustrates this with an example. “At the end of a purchase, we assumed everyone would use the command “Check Out” because we had a button named “check out” in the app. But we observed that people didn’t use this particular command. They said many different things, from ‘Bill bana dijiye’ to ‘Mera ho gaya, ab kya karoon?’”
To deal with this diversity of input, the Flipkart voice assistant would need to be correspondingly flexible and intelligent.
The technology behind the Flipkart voice assistant
“There was a big shift in AI that came in 2012. The whole perspective on how AI could work changed completely with deep neural networks which came into prominence in 2012 and 2013,” explains Kishore Mundra, Director Of Engineering at Flipkart and one of the co-founders of Liv.ai.
“Soon after that, in 2014, we realized that language understanding was a problem that only the top few companies had solved. The size of the problem was enormous,” he elaborates.
It was in 2014 that Kishore started Liv.ai with two batchmates at IIT-Kharagpur. Their tagline was to “Give voice to a billion people”. Soon, they developed an AI-driven, in-house speech-to-text solution that provided accurate results in 10 Indian languages. The key to their success was their use of deep neural networks, a solution they pioneered in the Indian market.
“A neural network is inspired by how the human brain works, which is to say, it’s very good at recognizing patterns, even up to the millionth parameter level. You don’t have a rule-based system in your mind,” explains Kishore.
“We don’t think about how many signals we’re getting or what rules to follow when we’re experiencing life with our sensory inputs. The number of signals you get, the storing capability of your mind automatically takes care of it. Similarly, in a neural network, you don’t have to worry about the storage. You’re only responsible for the input parameters. Your neural network will figure out, after being exposed to millions of inputs, what the correct output should be for a particular input. For example, if you’re feeding it audio samples, it will eventually be able to understand through pattern recognition at a phonetic level what words are used in what context,” he explains.
In other words, a neural network, through its powers of pattern recognition, can be trained to understand what a human is saying to it, if given enough examples to learn from. This means that unlike a simple database searching application, a neural network can also — given enough training data — figure out different languages, dialects, regional slang, and turns of phrase, something that is invaluable in the Indian context.
Rules to conversation: a “natural” evolution
“Natural Language Processing means that you understand real user input. ‘Real’ in this case means that it can be an unstructured input. In databases like SQL, you need to give a command in a very structured way. In Natural language processing (NLP), you should be able to speak in the most colloquial way possible. In other words, there should be no need for the user to adapt to your product,” explains Himanshu Agarwal, Tech Lead – Software Development Engineer and Flipkart’s in-house NLP expert.
To support free flowing user inputs without any constraints of being syntactically correct or adhering strictly to just one language, the team preferred ML based approaches over Rule based approaches. “We took a Deep Learning based approach where the NLP models were trained with a small number of colloquial phrases and natural language sentences. Based on the learnings from these training examples, the models were able to generalise to a much wider variety of queries obtained from real users. Moreover our NLP system was inherently designed to handle mixed Hindi and English input without the need to specify input language,” he further explains.
This evolution away from prescriptive restrictions on user input is also part of the broader paradigm shift in the world of computing mentioned by Nandini. “Context preservation is very important when we design an online shopping experience. From asking for a pair of shoes in a store to window shopping in a mall, we want to anchor these design decisions in the real world. It’s an ongoing metaphor,” she explains.
In other words, if you were to go shopping for groceries at your local store, you wouldn’t have to be precise in your instructions about what brand you’re looking for or what items you want. You’d have a conversation with the shopkeeper, where you would express your needs but where the shopkeeper would also anticipate your needs in context. The Flipkart voice assistant is trying to recreate that informal feeling of going to the shop while shopping online.
Meet illustrates with a simple example. “Let’s say I go to a stationary shop to buy some school supplies for my kids. I can say — mujhe pencil box ka saara samaan dikhao. But when I’m using an e-commerce app, how do I type a statement like this? I’d have to sit there and break down what kind of things are usually found in a pencil box, how many of those I need, etc. But with the Flipkart voice assistant, I can just say this, and let the machine do the work and show me the options,” he explains.
Taking cues from how people shop offline doesn’t stop at just recognizing users’ needs and demands, it also means being able to speak to users in a way that makes them comfortable. From the get-go, Flipkart was determined to shape a warm personality for its voice assistant — a voice that was friendly, professional and uniquely Indian. To achieve this, Flipkart created its own in-house text-to-speech (TTS) software, now in its final launch stages.
“Currently available text-to-speech systems are well trained in English, but when they speak in Hindi, it’s usually with an accent. We wanted a voice that was comfortable in both English and Hindi,” says Anupam Singh, Engineering Manager at Flipkart. Under Anupam’s stewardship, Flipkart’s in-house text-to-speech team worked hard to ensure that the Flipkart voice assistant speaks in the correct tone, professional yet friendly.
In the process of creating this uniquely Indian voice, the team applied their core knowledge about Indic languages to inform how the TTS system performed, even when rendering English words. “When we were working on the Flipkart TTS, we leveraged the fact that Hindi is an inherently phonetic language: what you write is exactly what you speak. And we find that neural networks learn a better pronunciation model in less time when trained on indic script. So we trained our TTS system using the original Devanagari script so you can give it a mixed language with both Hindi and English words and it can speak both quite well,” he explains. In other words, if the Flipkart voice assistant were a person, her first language could be Hindi!
The Flipkart voice assistant: salesman or a tech-savvy friend?
In fact, the voice assistant’s “personality” is one of the key aspects of the design process. It is a personality that has been developed based on a uniquely Indian shopping behavior. “In India, there’s a culture of tapping a friend or neighbor on the shoulder and asking for help when it comes to online shopping,” explains Meet. In the research leading up to the creation of the Flipkart voice assistant, one of the key observations was that Indian shoppers preferred to shop in the company of a perceived expert.
These people weren’t “experts” in the sense that they knew more about the products than the users themselves. But, they had a better understanding of how to navigate the website and how to seek out better deals and prices. According to Nandini, this is a design choice the team had to spend a lot of time deliberating while trying to replicate that experience online.
“We had to take a call on which side of the counter the voice assistant was on, metaphorically. Do we want it to play the role of the shopkeeper or someone who is on the customer’s side who is helping them with the purchase?” she explains.
Eventually, they went with the second option because it was closer to what the research showed the customers wanted. “It’s not an expert on e-commerce, but it certainly knows something about the inner workings of this huge site. Now, with one voice command a customer can filter their options by affordability, and effectively create a whole new store that’s entirely affordable to them,” she says.
Flipkart customers can expect the voice assistant to be like a tech-savvy friend. Someone who is looking out for your best interests and can help you hunt down the best deals and offers.
“Our vision is to transform digital marketplaces with conversational commerce,” says Meet on Flipkart’s focus on building a relationship of trust with the customers. “Conversation is at the heart of everything we do. We believe this has the power to completely transform how users interact with e-commerce and once they do that, they can be confident that Flipkart understands them. With this transformation, customers will be able to see the breadth of selection and value that we can offer them,” he adds.
In Nandini’s view, the Flipkart voice assistant could democratize e-commerce further in India by making it more inclusive, something she sums up with a powerful metaphor. “Voice isn’t there to replace every other way of interacting with our platform,” she says. “Let’s say you have a building, and the only way you can get in is through the stairs. If 80% of people can climb the stairs, that’s fine, but what about the 20 percent who can’t get up there? These are people who might be pushing a baby stroller, or people who walk with a cane, or people who use a wheelchair. They could be pulling some luggage or could simply be tired. If we introduce an elevator or a ramp, we’ve suddenly opened the door to so many people who would have never got there before. Voice is similar in that sense, it’s an enabling interface for so many people.”