Mei 20, 2019

Episode 4: Realistic Approaches To Machine Learning At Gojek

Episode 4: Realistic Approaches To Machine Learning At Gojek

Intro: Welcome to GO FIGURE. My name is Nadiem Makarem, CEO and founder of GOJEK Southeast Asia's first Super App. GOJEK does ride hailing, food delivery, payments even on demand massages. You name it we do it. GO FIGURE is a podcast dedicated to expose the inner workings of ambitious tech companies in the emerging world. We like to talk about things we like and talk about things we don't like. There are a lot of myths out there that we want to dispell. So keeping it real is kind of our mantra. Hope you enjoy it.

Nadiem: Hey guys, welcome to GO FIGURE. Thanks for being here. Uh, I'm here with Andrew and Maneesh today. Just a quick introduction, Andrew is the Head of our Marketplace Product Group that basically manages all of the different, both incentives and processes between demand and supply. In our case, our drivers and our customers across multiple services. So a huge amount of automation and machine learning that gets employed within Andrew's product group. Andrew recently joined us from Uber. Previously and Maneesh our Head of Data Science, who manages the functional organization of data science across GOJEK. Thanks for being here guys. So let's talk about buzzwords. There's so many buzzwords out there. There are conferences out there, there are entire, you know, fields or of content built around this. There's words like AI or it's like machine learning. And all of these I think have an element of reality. Some people think it's the cure for all diseases in tech. Some people think that it is the beginning of a complete transformation and how we operate as a society. How do we kind of separate out myth from reality here and what I wanted to throw first you guys to kind of talk a little bit about what are some of the misconceptions about machine learning and AI and about data science in general that you guys have discovered as you're actually executing it?

Andrew: I'll go first because we'll turn over to data science next. But I still remember the first sort of machine learning model that I worked on and the thing that I learned the most about that we were doing churn prediction, this is years ago and has a very simple neural network. And at the time when I started reading a lot about it, finding out the techniques that most current machine learning and data science are employing are actually quite old. And to me it was something that you don't really exploded on the scene. And when we talk about buzzwords, it is a newer buzzword and something you have in your mind, robots and all these kinda crazy things. But the way that we think about it in the basics of it are actually, you know, 50 plus years old. And right before we got on air as Maneesh is talking about one of the, one of the advents, it's really facilitated this explosion is increasing computational power. And so it was companies like GOJEK and tech in general, you started a massive amounts of data, uh, the ability of computers to, to let us crunch that.

Nadiem: So there was nothing actually new about the concept of these automated algorithms. They just didn't have the computing power back then to actually make it as useful as it is now.

Maneesh: There is a certain element which is pushing the boundaries of knowledge. And on that I'll do the core fundamentals really are, you know, something which really existed 30 years back 40 years back along with the computational hardware, you know, which is able to crunch all this massive amounts of data. There is an element of how we now take, make a feedback loop on, you know, the decisions being made out of these algorithms and how we can improvise, right? Some of these were not something which we could tackle, um, you know, a few decades ago and now we are in that position where we have these real time streams of data. We can take an online decision and you know, manipulate real time and then see how that model figures out in the open and then self correct. Right? So there is an element of self learning as well, just really pushing the boundaries, which is something called less reinforcement learning. And that's probably the future, which, you know, everyone is really scared about that AI will take jobs, we'll all get automated and you know, uh, so on and so forth.

Andrew: I think you touched on being scared about one of the other big changes is that businesses and customers have become more comfortable with, you know, data, predictive data, predicting everything in their life. When one of the reasons as well that this didn't take off a long time ago was this concept of a black box. And one of the things in business that we struggle with, with all sorts of machine learning is you sort of give it a set of parameters and it gives you, it gives you an answer. But how it came up with that answer is, is actually a black box and there's been a lot more, you know, a big admin and people becoming much more comfortable with the notion that the answer that it's given to you from this black box is right and it's probably better than you were going to do on your own. And that the changing of mindset is certainly allowed this to expand massively.

Nadiem: Well then how do you say that something is that you are utilizing machine learning. There's so many companies that say that they are leveraging machine learning. But in your definition of when does it get to the point where you say, okay that's legit machine learning. Like what, what qualifies for that?

Maneesh: I mean machine learning has to, you know, Bush the Bush, the business metrics in some sense because otherwise it's just an intellectual exercise, for everyone. Where you know, you just take data, you take a little models, you throw at it and you get scores. But if not, it's not driving business impact. Right? So the moment we get into the domain where we are really talking about impactful data science is where we can use these models to you know, impact the experience of our customers or you know, drive the growth of the company or have some efficiency and you know, the way we allocate budgets or spent. And that's how we know that, you know, these are successful examples when machine learning is really creating a big difference for many of these companies. Many of the companies I personally know who really struggle as really the have large data science teams but then they are not really effective because they are not going in that direction of really integrating with the business because here the real art is actually the problem formulation from the business perspective, right? Models exist everywhere. You can plug in data, you can connect models to those and you know you get an output, but how that output really integrates with your business processes, with your particular localized domains and then creating an impact is where the real difference comes in and that's how you become successful.

Andrew: Let's break down a bit of an example of that. That's what a pipeline, right? You know you talked about models and we talked about some more buzzwords, but as we go from let's say Algebra on the left of A plus B and a human making all the decisions to machine learning. You know, we start with identifying something that I think one of the pieces you need to start using models and machine learning is a very complex problem with lots of inputs. This is not something that is easily solvable. And then we, the first step we do is we, we tackle a model. And so the first model we'll make will be kind of the simplest one we can do and to create a model of the basic, the basic premise is we give it a ton of data and we sort of give the model and objective function, whether that's providing a score. So it's the three of us and we're trying to say no predicting who's the tallest. Maybe we would take how tall our parents were as inputs, what kind of diet we had growing up, where we were born for socioeconomic reasons. And in the model could taking those create a kind of a black box methodology, just score each of those factors and give each, each of the three of us a predicted height score and then we would take that predictive height score and we would look at what the actual height of the customers were and then we would say that model was good or that model was bad. And that's how we first sort of iterate on the creation of a model of any sort. From there optimizing it and then eventually plug it into kind of machine learning where then it will optimize itself and it can pick up new inputs to that and in sort of what solving more complex objective function is the pipeline that we go through and this is something Maneesh and I do a lot at work is a problem, what sort of model would we have to tackle it and then how do we ultimately sort of automate it through machine learning.

Nadiem: There seems to be two kind of themes that we keep coming across. The first theme is actually predictive. That concept of being predicted if you're not predicting a specific outcome, but hopefully in real time. Right. So then it doesn't count as if it's not really machine learning. Right. So the first time that predicted, what about, then there's this other theme of self learning or self optimization. Do these two criteria need to be met for it to be considered a legitimate machine learning model?

Maneesh: Not necessarily so. So there is also this myth you touched upon, you know, what kind of things people are really confused about when it comes to machine learning. So not every algorithm which we try and deploy in production is something which is a self learning. So what we try and you know, I revert back to what Andrew's example on, you know, predicting heights of people, right? Unless there is a mechanism where you know, you predict a number, predict your height for example, and you have a feedback loop that you know, you try to predict, you saw how much was it off by and then make that as a feedback mechanism to the model itself. And then the model kind of self corrects using that piece of information to go in the right direction and minimize that error. That's when you call, you know, it's self learning, right? Because then you can leave it out in the open, it can encounter various kinds of data and then go towards that optimized objective or you know, the error function, which we want to, basically formulate here. So there is this also, I mean many people confuse that, you know, a supervised learning where in you know, what you are going to predict. Like for example, in this case we are, we know we are going to predict heights, but there can be situations where you do not know what the outcome is going to be. So for example, one classic example is like if a company wants to, you know, make t-shirts of various sizes, right? How does it come up with what kind of size of t-shirts should it build? And then it tries to imagine like, you know, okay this is the sample population of you know, a particular city. And then there are some unsupervised techniques how we can actually map it to what would serve the most, you know, most populations use case. So that's how you can actually club similar groups into something. And we just called clustering and that allows you to bundle your, you know, similar patterns into one bucket. Because when you, when you start off with, you don't know how many would actually fall into that criteria. Right? So that is something which falls in the realm of unsupervised learning and where you do not know exactly what your outcome is going to be. And then, so this is the classical machine learning and the stuff, which is really picking up pace right now is that reinforcement learning, which I was touching on where you really react open to an environment. So there has been a lot of research where bots have been able to identify their environment navigate it without little or no training from before. So they really interact with the environment. Just like a human does, tries to see what that optimal path of navigation would be, does make mistakes, learns from those mistakes and then auto corrects that. So that's more like a reinforcement learning and self learning.

Nadiem: And in all honesty, even in GOJEK at the size that we are today, we're still quite far away from that ability to do that.

Maneesh: Yeah. So there are only very few companies which can actually claim or leverage that you know, they actually use reinforcement learning and more of a production environment only maybe a few top tech companies. And, you know, in very few use cases I would say, but the classical machine learning is what is, you know, getting adopted much more than is really driving business impact. There are, you know, uh, research a lot of research which happens on reinforcement learning. Uh, but that's not really, you know, at a stage where we can say, you know, it's readily adoptable by yeah a lot of companies.

Nadiem: There is this super crazy concept, that I've only learned once once, uh, I joined GOJEK about certain machine learning models going stale or somehow having a life of its own and somehow changing, its own identity and its own processing feature on its own organically. And I'm very interested about that, that the fact that these things cannot just sell, optimize without kind of constant love and care from the team. So I found that a very interesting, almost like life-like characteristics of algorithms within machine learning models. Can you explain a little bit how that works and how, how algorithms can kind of go rogue, for example, and maybe show some examples of what you've experienced on yeah.

Andrew: There's certainly, there's certainly no shortage of these. And I think the, the root of this problem is a pretty classic startup issue, which is the launch and forget mindset. And one of the things about machine learning and algorithms is the sort of automate something that usually a person who was doing previously and with hyper-growth companies are so many things to do that particularly great employees will as soon as the thing they're working on is no longer the biggest problem. We will move to the next problem. And I think one of the, one of the downsides practice, the algorithms in general and certainly predictive models is if you, you know, you spend all this time creating a very good predictive model based on the data of let's say last year or last month and it works and you look at it the next month and then maybe the month after that. But at some point it becomes sort of natural human tendency to just stop looking at it and in that, so that's the root cause of what happens here frequently. And then, you know, some examples that I certainly remember and some even at GO-JEK that I know cause we've created models to you know really optimize to make sure that we are allocating our trips efficiently in ways that we know we will. We will complete the highest number and the two months that we tested it, it did a great job at that. And then the longer we let it run, we noticed that some of the ways it would do this had, you know, negative impacts on the population. And so it could for perhaps, you know..

Nadiem: So it was making its own trade offs based on its own limited logic...

Andrew: ... and the way that it made the trade offs in a small sample size that we looked at it. So during the part that we were closely monitoring it, the tradeoffs looked great. But as the company grew, some of the downsides that we didn't properly measure got bigger and bigger and bigger. And then eventually it became a problem where when it came back onto the radar of the people who created it very quickly, did we realize what was wrong with it? But this, uh, you know, launching forget was really what happened.

Maneesh: Yep. I mean, anytime you build these models right, you are only working with a limited scope of that data set and you have little or no control. There might be an entire shift in distribution of the data going forward. They could, you know, be a structural change in the market, which we are not really, you know, uh, trying to cover up with whatever variables are embedded in the model right now. And that's an uncertainty which never really goes. So I don't think, you know, there are models which can claim that, you know, they are really bias free. I mean, that's a big debate in the whole machine learning community than how do you actually make models which are very, very generalizable. And that's always a constant challenge because you don't know when these models go out on play. There could be, you know, okay. Think of it like going back to the example of allocations, right? Data is entirely possible that, you know, the whole city planning, Structural change could change the flow of traffic. Roads could become one way, but we are accounting for that when we were creating the model, probably not, do we even have the information about that because it has just changed? Not necessarily. These will eventually, I mean a result into some of the other kinds of bias which probably are, you know, not in the control or even in the times of modeling. There are certain ways how you can actually account for that bias. And now it really depends how much you know, how much a priority and you know, posteriorly checks we do to really quantify how much of a margin of error do we have once we go in production.

Nadiem: I remember we had this constant issue whereby our systems were not good enough at handling rain. Right. And rain was a consistent issue that keep coming up. And for some reason it was, they were ill prepared to come up with the best possible outcome for a rain scenario. And it required some additional modeling for it too, to actually adapt and then work later on. So these are not contrary to what a lot of people believe. These are not just self sustaining systems that don't require care. They are very, very needy organisms that require constant updating and require constant love and nurturing and monitoring. Right. And so just to kind of give some of our listeners a sense of scale about how we do this. I mean, we do millions of transactions a day on the GOJEK Platform. Wanted to ask Andrew about, you know, when we book a car or a motorcycle on the GOJEK app, how many variables do we actually factor before deciding who to give this order to on which driver to give this order too?

Andrew: Oh, I think that this is a great question because the answer yesterday, today and tomorrow are kind of almost exponentially increasing numbers. And so the first iteration and the way that a lot of point to point logistics company started is some measure of distance, just trying to get the closest driver to customer, right? And that can be, you can do sort of ETA distance, you can do straight line distance, whatever it is. And that's the most simplistic and that has ... it makes tons of sense sense on why everything starts that way. And then you move to a slightly more complex model where maybe it's, maybe it's distance plus one or two other things. Where we are today is we're using distance plus, I believe, and you may know the exact number, but north of 10 different characteristics, but even within those characteristics we may have subcategories to change a little bit depending on the exact situation going on and how we do this across all of our different services. Maybe the things you want to consider for food order and a Go-Ride order are entirely different as well. And as we look to the future, which is, you know, we're currently solving today and tomorrow's problems, you know, we're gonna be looking at potentially hundreds if not more characteristic for individual orders. Something we touched on earlier was computational power and the, you know, the load on our systems. But if we're doing millions of trips a day and each one is, you know, just for the first iteration of how we allocate is currently, you know, tens of characteristics you can do the math that we're you know, we're getting close to billions of computational assessments just on how we allocate, let alone prices and routing and everything that comes downstream.

Maneesh: Yup. And there's a huge element of personalization as well here, right. Which we try and factor in while we try to optimize the whole marketplace. Now imagine there are certain portions in a city where it's very, very hard to pick up, right. And we might have seen historically that, you know, some of these places are hard to discover and you know, the ETAs become usually quite high. What can we do to actually optimize, you know, in terms of allocation over there. Are there drivers who really get confused and end up canceling and how do we make them, uh, you know, uh, orders which can be completed to the maximized, to maximize the expectations and, you know, experiences of our users. Right. So these are some of the little new answers which we can also address using some of these newer algorithms. And, you know, as Andrew mentioned, the number of dimensions and variables. I mean, we can pretty much bloat it up as much as we can. And the more granular we go in terms of the features that we use for these models, the more personalized, you know, experiences we can give to our users. I mean, in some sense, if you deviating a little bit on the recommendation side of things as well, right. Uh, what we do for the food users, we always try and cater to their historical patterns of usage. Keep recommending them.

Nadiem: You're referring to us, uh, recommending to our go food users. Which restaurant or which dish they might like. Right, right. And personalizing it to their specific tastes. Yeah. Just to clarify that to the listeners.

Maneesh: So in terms of what kind of food preferences they have, what kinds of cuisines delight, what kind of, you know, um, do they really want a fast food? And that's the ETA bit of it where you kind of try to do in your whole app based on a user's flavors and you know, his choices and the more granular I mean, and the more personalization you want to add with it gives them the better experience. And you know, that is the Aha moment, which everyone is, you know, uh, trying to optimize for. So...

Nadiem: I find that fascinating. I think, uh, the analogy that I heard you use is that our data science initiatives on personalizing dish and, and restaurant recommendations is almost a kin to, uh, trying to replicate the human tongue and, right? So essentially the human tounge and the human hunger reaction within an algorithm, right? Because you're tracking two different things. You're not only predicting affinity to a particular taste or a category of food, which is why our tagging initiative and go food is one of the biggest things and how to data tag it and more specifically to the type of dish as well as that impetus of am I hungry now? Or what time of day is it? When do I feel like during that particular time, what did I have yesterday? What did I have yesterday, what I generally like? And then hopefully the next step of the evolution is beginning to target. You know, things like, mm, what kind of a health program am I trying to achieve on these days? Am I feeling healthier today or no? Am I feeling more crispy today or more spicy today? And I find that food is a perfect microcosm for the personalization cause it's much easier to explain. How would you be able to do that without machine learning? Because essentially the constraint that you hit is the number of variables that goes into why I decide to buy a particular food is so immense that the correlative and causative effects of those variables to my decision are way too complex for the human brain to determine. Right. And that is, we have to rely with multivariables exactly what you said, but so where, where does, where does it leave us? Um, in terms of, you know, the applicability of this technology guys like everyone's talking about this, not just tech companies but also like traditional companies out there are all jumping on the bandwagon of hiring data scientists, having data engineering teams. Like when is machine learning not necessary? You know.

Andrew: and we can definitely talk on some examples of what it's not necessary, but I think, you know, the value proposition that machine learning and algorithmically driven decisions are providing companies is both the automation piece, which you know, can do a lot of things fashion than humans can as well as the improvement of key business metrics. I love the food example because one of the things we're solving for is trying to remove the cognitive load from the customer. If you open up a restaurant menu and you've sat down at a restaurant, you opened the menu and I always say I actually love restaurants that have 10 or 15 items because I don't, I don't need to be flipping pages to decide. And as we're trying, you know, we're fundamentally trying to improve the customer experience by hitting their need for they know what it is. We have this improvement in experience and when we talk about some of the allocation pieces, we match supply and demand. We're talking about kind of efficiency improvements. So there's definitely not a set list of what can and can't be there. But the reason that this is increasingly you talked about is there are very tangible ways that businesses can save money and make more money through some of these efficiency driven pieces as well as actual experience improvement. And not every model does both. And usually models, we'll probably do one or the other, but in that bullpen of experience improvement, you know, every business has a customer and efficiency improvement. There's a whole lot of things that fit there.

Nadiem: Just on that point you mentioned, you mentioned this point about not all models can be applied to everything else. Right? And that in the beginning stages of our, of our kind of data science journey, I was as a CEO, I was super frustrated at why we couldn't have like a single master model, right? Hey, if we are matching willing users with drivers, why can't you use that same engine to match willing users and restaurants? Right? And just have that as like a matching auto generic matching algorithm that just takes into account different variables and maybe has different output metrics. Like for the food part, the output metric would be, oh, I ended up converting on that restaurant and for obviously for our drivers and customers that the ride was taken and completed. Right? Those are the key goals and objectives. I was very quickly corrected by the team that, hold on a second. Generalizable models can be disastrous and generally speaking are not recommended, especially when you are in the beginning stages of this evolution. Right? I don't know what you guys think about that.

Andrew: So we are still early. Dude, that's the end state. That's where we want to get to. The fewerl model is the best or talk about some of the technical issues on it.

Maneesh: So I mean the very first thing startups need to realize that, you know, when they are at the start of the journey. I know there is lot of uh, focus on, you know, jumping onto machine learning right at the start, right? What do you really need? Historical data and historical data with clear signals and patterns before you even, you can embark on the journey.

Nadiem: All right, so you need that. You need a substantial enough amount of historical data that you can access. Yeah, it's a very different, it's not just data that you store, right? The amount of data engineering work we spent just trying to make all of that data accessible and pullable. Forget about real time. Just accessible first time. It was super hard. Real time was even harder. So you're saying that should be the first requirement. You need to have a lot of it.

Andrew: The one other piece on data quality is you gotta build a store, have tons of it. You know, I don't know what our data creation per minute is, but per hour has to be gigabytes and daily, maybe even terabytes. But it also has to be representative of what you're trying to solve. So when businesses are in hypergrowth like the line you here in businesses, past performance is not a good predictor for future performance. So you need to get to some point in your growth curve where your data is somewhat representative when you're, you know, five x month on month. The data just, you can't train models on that because it's too volatile.

Nadiem: Doing 100,000 orders per day to get to a stage where you're doing millions per day, everything changes,

Andrew: everything changed your customers. But you know, what sort of customers, how deep in the acquisition piece are you? And so your customers sort of change entirely as well. The number of restaurants you have in the food example, the whole game changes and your model, you train the model all the way back here and we're still making decisions on it. The pieces you trained it on it, it would no longer be useful.

Nadiem: Okay, so you need a sufficient number. That's the first part. You need sufficient amount of data that is still useful and more, still relevant, to the practice. What else do you need before you can even start embarking on this?

Maneesh: I mean and you were touching on the point on the relevance of the data, which you are actually tapping, right? I mean imagine you were trying to solve the food personalization problem, but you really do not store where users click what they search for and eventually what they add to the cart and then checkout. Then the whole thing is not necessarily something which the data science team can even start to tackle. Right? So that's the very first having that relevant features. Now, after you do that, you really go to an exploratory exercise where you try and figure out, you know, try and define the problem. Like I said, that's the very first start. What is it that you're actually trying to solve? Right? And that basically defines the premise as to what goes into the models and what is the expected outcome. And then there's this whole exploratory exercise where we see other models really giving us any useful patterns which are in the data or is it just, you know, random number generator, which you know, just is predicting, uh, like a flip of a coin or something. So we have to make some conscious calls whether there is some underlying pattern in the data or not. And then we can see the efficacy of these models, which really start to, you know, um, show us some tangible results.

Andrew: One of the things that Nadiem asked that we haven't answered is when don't we use models, you know like with times where we've done this, gone through the training and then we looked at and we just said, you know, this was no good at all. And then what are the things that we just don't at all. We don't even look at the problem as a data science solvable problem. I can give one example, which hopefully it's gotten better, is predicting the ability to predict churn is something that is a very common tech company problem and all e-commerce customer basis, but a lot of the time what you can predict your churn pretty well, but then the ability do, you should do anything about it is nonexistent. And so you've got this model that tells you what happened yesterday, but that is actually not all that helpful in that. That's a personal, a couple of you have any others?

Maneesh: Yeah, so I mean, so going back on that part right, machine learning is often treated as a black box, but just gives you a score. Doesn't really tell you what's happening in the insight. Right? And that's where it becomes very important to, you know, depending on the problem, do you need any causal understanding, causal reasoning as to why something's happening, right? So that's when it becomes very important. You know, it doesn't really just become a black box depending on the problem that you are trying to solve. I mean, and then try and, you know, formulate the problem. So yeah,

Nadiem: So you do see teams spending a huge amount of investment and time trying to predict metrics that they don't necessarily understand the root cause and therefore don't know how to move it ultimately. Right? And then you only realized that after months and months of investment in spending. So I guess that's a lesson learned that the metric has got to be whatever you're trying to predict, you need to already have a sense over generally the causative relationship with other factors. So you can do something about it. Right. Based on, right. So once I predict x, then there has to be an action associated with which to either prevented or increase the probability of it happening. Right?

Maneesh: I mean, as we, as we try giving you an example, like as we trying to minimize the ETAs for our customers, right? And imagine we do not have any idea about the actual distance between the point of pickup and where the driver is, right? We are not in the best position to actually optimize that because we don't have that piece of information. What we end up doing is we'll use a birds distance, which is less, you know, uh, which is a Howard sign distance, I mean distance between the two points, which is very theoretical, right? So that is something which we do not have any understanding of and how do we optimize for. So now that's a missing piece, you know, which the model can't optimize. So it's very important that the feature set that we are looking at while trying to build the model really captures the whole dynamics. Right. And that's one of the more essential pieces.

Andrew: Sorry, just one more than I think about too, that most companies are still creating budget allocation and excel from scratch with a whole bunch of kind of C Suite members. And in business heads, bottom up budgets, right. This is something it's very analytically driven so it's type of thing that feels it would be great for machine learning and algorithms but very few companies are doing serious predictive sort of budget allocation for their sort of annual planning and this is a little bit, not so much that the some of the issues where do you trust last year's data to make this year's decisions and data quality piece, but just the size and impact of the problem. You know as a CEO, if someone gave you a budget and said our best data scientists worked on this algorithm and that's the budget, you probably still have someone kind of bottoms up scrub the entire thing. It's just too important to not have the ins and outs to be something you can really control and that is something that data science is still still tackling this loss of control for business leaders who don't have a fantastically deep graft with data science principles.

Nadiem: I think that's a fantastic kind of take on the importance of the output of that data science problem you're trying to solve. I think a lot of people assume that data science are usually only related to, for example, in technology companies, consumer facing issues, whereby there's a huge amount of inefficiency. Um, maybe acquiring users, keeping users like really complex stuff like credit scoring, all of which, you know, in the GOJEK universe we will do or will eventually do, but what we don't think about it enough are some of the biggest allocation planning tools. Because again, keyword, predictive, right? If companies can better predict their budget spent overspends or underspends year to year with actually a huge amount of data. Actually I have, every company has very strong accounting. They know they can calculate the return on every incremental dollar spent and so on. Why aren't machine learning tools being deployed for non tech purposes? Like the two most important resources in a technology company? Number one is its talent, its people, and how to deploy that most important and scarce resource, which is talent. The second thing is money, right? Essentially a technology company today are a combination of these two things, right? Yeah. But more, I would say people is a slightly a more important factor of allocation of resource. So predicting the quality and performance of people in your, from your recruitment funnel to when they're inside to how they will track over time predicting that and predicting churn in people, right? Why are we only predicting churn in our ecosystem players, why not the most valuable players we have, which are internal employees. So why aren't we predicting internal churn

Andrew: When we spend so much time trying to predict how customers will interact with our APP? How will they go if they, if their first experience with GOJEK was GO PAY, how long will it take for them to take a GO RIDE and then a GO FOOD trip. But the people, one is another area that is largely devoid of serious data science. In fact, not just to GOJEK, but anywhere with maybe the exception of LinkedIn. But you know, predicting what sort of development curve your employees will be a let alone predicting churn and burnout. Right. And, and the same with how you hire hiring is still today and you've hired hundreds if not maybe thousands of employees, but still an exceedingly manual process. You know, there are some objective measures of scores at the end of experiences and in terms of really leveraging all the tech has to offer both finance and people development, internal, they're things that are lacking in DS (direct sequencing) in just about every company that I know of.

Nadiem: We might add some more work streams to this. No, but I think that that's something that, that you got to start thinking with what matters the most and how can data science, uh, help you solve those problems. If we could be, for example, we could be improving our recruitment efficiency by so much bigger. If suddenly we realize that, hey, actually people who come from these companies of this age, with this education background and have at least these organizational experiences have a 80% higher probability of performing and being promoted in GOJEK versus others. I consider myself an okay a pretty good interview and recruiter and I couldn't even juggle two, three variables in my brain at the same time. I don't remember who I hired before. Right. I don't, I don't remember where they worked everything and all the details of their life. An individual contribution, individuals historical experience captured through their CV is such a rich data point that is that no one actually ever thinks about. And on the other side, on the budgeting side, I'm super fascinated by your comment as well because it's not just the planning, it's not just the predicting part. Even more importantly I think data science can add value in the allocation part cause right now you know big company, small company just because you add more analysts to the budget planning process, it doesn't make it more scientific. Right?

Maneesh: Also, I mean there is a component of is that budget really realistic, right? In feasible solutions we need to discover those and looking at historical conversions. We know, you know, how much amount of money you spend using some marketing channel and how much conversions and you know, how many sign ups you would eventually get. Right. So we have a rough idea, you know, how the company is going to grow depending on demand or supply side, how much money you put in. And that's where you can leverage bunch of features to really optimize your, you know, whole budgeting process. And yeah, I mean we have all seen, you know, great input that being put to great advantage and you know, how you can actually break down your overall targets really to a very small level of let's say even cities or you know, districts and then optimizing at that level. So, yeah.

Nadiem: You know, going a little bit outside of the tech sector itself, because this is a topic that really, I think, uh, interests me also. I am, I am constantly amazed and surprised at how little or how little governments leverage data science in the governance of a, either a province, a city, or a country for that matter. When it seems to me that the problems that data science, uh, solution solve is even is exponentially more important for a public institution or a government institution. I don't know what you guys thought about that. It just, the more I learn about machine learning and data science is it's almost like the most powerful tool for governments and, and, and public institutions with which to scientifically find the Optima for whatever it is they're trying to solve. I don't know what you guys think about that.

Andrew: Uh, very topical example now is China and credit scoring or you know, citizens scoring of people, which must be, must be an algorithmically driven calculation based on a ton of input. But earlier Maneesh kind of outline what's required and we can touch on those again, but high quality data, volumous stored efficiently and a way to access that data and in an ecosystem that is relatively open source I suppose and easy to get into. And then, you know, high quality data scientists and something we didn't touch on is an environment where you can experiment. Because part of what I love the science and data science, it is very basic and it's neat to experiment. Your first iteration of models is really good and sometimes it's terrible. And so if you look at governments, how great is the data storage, right? How accessible is it even internally with the need to keep, you know, different sectors of the government. It's very well known that governments are great across collaboration, cross communication and then, and so those, those are real structural challenges. I think some countries do better than others, but certainly a challenge.

Maneesh: Yeah. I think data is the, is the real bottleneck over there because many of the times many of the government functions do not do enough to really track all the variables that they would want to keep, uh, in, in control and given the part on the models. And we touched upon explainability right now we are talking about nations using data science and you know, allocating budgets. And if you have these models out there which really decide the outcome of a country, it now starts to take a shape of a very strategic power, right? Almost almost like, um, uh, a very strategic advantage as, you know, it's almost as termed like the second industrial revolution, right? Uh, AI is that wave and now governments would want to really tap in and many of those aren't making those strides that it really gives them that superpower status. Right? And it has the potential and you know, in near future, I'm sure we will see that, you know, there will be a certain discriminatory power some countries might face which made that strike today to actually take that leap of faith and you know, invest heavily on data science and machine learning initiatives and the government.

Nadiem: I completely agree in the transformative effect. It's just that like I'm sometimes amazed like how do you actually optimize the public transportation model without data science? Like I don't, how can you,

Maneesh: That's the point. If you don't collect that data, if the public buses do not really have something which tracks the gps, you would not be able to.

Nadiem: Exactly. Another perfect example is chronic illness, disease management health care system. Right? If you're not collecting all of the information from clinics from, from medicine being dispersed, that how do you actually figure out if your interventions are working or the capacity of your Dr system is sufficient or whether or not a certain behaviors are or health behaviors are contributing to certain chronic illnesses versus others? I just, it's somewhat frustrates me that these kinds of things, like if I, if I were a, and there was like 100% of my recruiting budget, I would definitely spend upwards of 20-30% on data engineers and data scientists to first let's get the data and create software that will allow,. force people to actually contribute that data in a privacy sensitive way, obviously in a very anonymized way and, and, and be able to collect it and then have the data scientists start, you know, at least preliminarily screening these data sets for finding some kind of correlative if not causative relationships. And I think there's, there's one kind of term that I want to kind of separate here. The first time that I learned that there were two separate kinds of functions and departments. One is the data engineering and one is data science. Right? In the beginning of this journey, I finally realized, oh, those are two different things. So can you share a little bit about why these two groups are important to have and what's the difference between the two of them? Right.

Maneesh: So, I mean, we have been talking about how the whole model is built and the very fundamental thing which you really need to have is the right amount of data in the right shape and in with the right availability, right? So you could have tons of data, but if it's not something you can easily pull, I mean, uh, into, you know, uh, whatever language you are coding and then it literally becomes a bottleneck in the process. So imagine the data, you have streams of data, you have offline data, you have, you know, a variety of sources of data. All of that really needs to be sanitized, put in a data store where it's really easy to play around with it, otherwise it becomes a huge bottleneck and hurdle, right? And then you would always have these constant challenges that you know, hey, we are not tapping these certain variables and someone has to really pick it up. Right? So for example, users, how they scroll on our APP, that tells a lot about, you know, the discovery aspects of what we have, uh, on you know, various services that we have. So we really need to start building those pipelines so that there is a constant flow of data and then we can, as we talk about this adaptable, uh, you know, models which can, you know, be real time. If you do not have that fundamental asset which comes from that data engineering team, it really becomes a huge bottleneck for the data science team to be, you know, effective.

Nadiem: So just to probably give some of these listeners a bit of a bit of context as well and how long it took us, like even for a company like GOJEK, uh, at, at our scale, I would say it took us about a year, right? A full year to really get data engineering correct. And so if I'm a, if I'm a founder of a newly growing startup, um, should I not even think about hiring data scientists until I have a very strong data engineering team? Would that, would that, would you agree with that?

Maneesh: Yeah. I mean, I know personally a lot of startups which have made mistake where they actually hired, given the height, the data science team first, whereas actually it should always be proceeded with this whole data engineering set of which needs to be to a certain shape and only then a DS team can be formed and be really effective at what they do. So yes.

Andrew: I think I sometimes think of data engineering and data architecture is a massive mall full of stores, departments and each store maybe sell something a little bit different. You've got your clothes, you got your food, and in your food store, right, you may have thousands or hundreds of thousands of products and part of data science and business intelligence more broadly, you know, gets to go into that mall and pick what they need to do, whatever it is they do when they leave this proverbial mall. And without that structure and everything neat and organized. Could you imagine if everything in a massive mall, which is entirely on the floor all at once and then the job of the data scientist is to go and sift through it and find it and you have complexity of real time data streams like KAFKA and what needs to be real time and then what do I need to store and what when I store things, how long do we store it? We have some government regulations or parts of the store, some data points for 10 years, maybe ad infinitum and some things we may be only store for weeks or months and then we let it go as well. And the architecture and costs associated with it too are no small feat,

Nadiem: So just don't, don't jump into it without doing the research. Right. These are, these are some of the most expensive talent, right? Some of the more scarce talent, especially in Southeast Asia and Asia, period. Actually their scarce, talent anywhere, everywhere in the world.

Andrew: Quality data is at the core of just about every company these days. And if it's not, that company is going to struggle in the longterm and getting that, getting your engineering set up in a way that you can store the right data for the right amount of time is so critical to getting off the ground.

Maneesh: That's an important point. We actually have not touched on data quality. Yes, I think it's always garbage in, garbage out. Even if you throw the best of the models which are state of the art. If your data quality is really...

Andrew: Give an example of data quality. This is something I think is hard. You know if you're a customer, I got a phone. How does GOJEK have issue with data quality and my phone talks to GOJEK the data's there. To give an example of data quality issues.

Maneesh: Think about aggregation on your know feature engineering; feature engineering is something where we take the raw data set and we try and build something which is more, um, in some sense it's abdicated, it's transformed and then used in the model. Right now, uh, thinking of an example, imagine that you are trying to find a store location.

Andrew: Location data is a great one for this.

Nadiem: Location of the user or the driver?

Maneesh: Either one. But imagine, you know that signal is very jumpy. It just keeps moving here and there, right?

Nadiem: Because of the GPS chip inaccuracy in say lower end android phones. Yeah.

Andrew: And big and cities with tall buildings. If you're sitting on the street level, your GPS is pinging to the satellites way up in the sky actually, you know they bounce off buildings, right? And then by the time they're going up into the sky, they're off. And so their, their error rates with all of these things on how accurate these things actually are.

Nadiem: And the system just assumes it's perfect. Right? The system doesn't know it's flawed. So that's your point about garbage.

Maneesh: And part of the data science process. I mean I think I missed that part earlier, is really to assess the quality of the data. Because if you do not do that, then you knew you always are trying to make sense out of garbage.

Nadiem: So the analogy would be that the data science models sees that suddenly this you, this driver's location pings are like popping boom, boom, boom and the standard deviation to some of these points is so high that the data science, uh, algorithm starts excluding those data points that just don't make sense. They could just be, they're off. Right. And these are outliers that are off. And if the data science models doesn't do that, then our allocation system gets completely ruined because we're allocating based on actual data instead of actual accurate.

Andrew: And for any, any of the listeners who want to experience this, I think on WhatsApp you can open your WhatsApp and hit share my location and they actually have your location, but it tells you the error distance and in WhatsApp. And so if we're sitting here, if I'm sitting at my desk and in Pasaraya, it gives me plus or minus 60 meters. Which of your thinking about a pickup street side or curbside? 60 meters. A lot of meters. We're not, we're not even just across the street or down the block. Sometimes I'm, I'm a couple of blocks around the corner on the other side of a building. And that's, that's the raw data that comes to us that isn't a GOJEK problem. That is a infrastructure of where the satellites are and tall buildings and, and phones and everything.

Maneesh: And to just summarize, I think the important point is to separate the noise from the signal in the data science process. Right. That's, that's really one of the fundamental things to me.

Nadiem: So garbage in, garbage out. Right. So let's, let's talk about, I mean, related to that point as well, you know, I think, uh, for the listeners that go, no, um, oh, we are, we are invested by Google, and it's been one of the most productive, I think, partnerships, uh, we've ever done. Uh, both from a product integration perspective and from a knowledge perspective. But how, how do you guys feel about, you know, the future of our capabilities now that kind of of Google is kind of, a variety of product integrations on the back end are happening with Google. And how do you think that will impact our both data science and machine learning capabilities? Because that's super interesting and exciting. Um, I don't know. It's probably some of our listeners are very interested in finding out more.

Andrew: I think Google's role in the explosion of point to point services globally is missed. But it will one day be a case study and you know, to anyone who says how's that the case, the open sourcing of maps mapping data and high quality mapping data covering almost the entire world. There are point to point services in every major city in the entire world and I don't know what percent, but a lot of them run on top of Google maps. It's Google has made this wonderful mapping system, open source, pretty good quality everywhere. I think that's the foundation and what that means for us that at GOJEK is as we look to, to kind of the future world, we kind of get to skip the generation of having to build our own mapping down it as do many companies. So then you say in a, in a point to point logistics. And something I think a lot about is when the mapping data is the same GOJEK shoes and Google maps, a lot of other people are using Google maps. Where do we find our advantages? And you know, we've touched a lot on improving the customer experience and customer data. But I think increasingly what it means is that we'll get to focus our data science efforts and our product management and engineering efforts on leveraging the GOJEK specific data we own to improve the experience and not working on, you know, trying to beat Google maps and the quality of ETAs and distances and those sort of things is that ultimately is going to be a challenge for anybody. So, uh, you know, the answer to that question for me is customer data. I'm sure Maneesh has to take.

Maneesh: Yeah, I mean, think about it from an optimizing this whole demand and supply situation. Like what is the most, I mean, speaking from customer experience perspective, we have our own personalization models which we try to optimize for and you know, uh, build a score on how customers want to get trips. But then the strategic advantage, which Andrew was mentioning about as well is how do we smooth and their experiences, right? How do we use traffic information, how do we use much bigger distance matrixes to actually sort and find you the best driver which is available at a certain point. Right. And then internally we have a lot of personalization, which we have touched upon earlier, how we try to optimize some of the user experiences. What are the preferences even on trips. And that just, you know, I think that collaboration gives us a synergy where we can go much beyond

Andrew: And I'll give an example, and we're not there yet, but something I dream about is, you know, not going to speak for you and your transportation needs, but I assume you probably take a GOJEK to and from work most days that you're in Jakarta every day,

Nadiem: every day.

Andrew: And you know, you probably leave about the same time and you probably go home about the same time, but sometimes there may not be cars or bikes where you live. And so in a world where we're able to use our historical data to predict where we will need cars and where demand will be at a future point, we can then start using all of that predictive, that historical data to predict what the needs of customers will be. And you know, one hour in the future, one minute in the future. And by using that data, when we make decisions right in this moment, we can take into account, oh, but we will actually need a car in that area in the next five minutes. And so while we were going to offer a trip, you know, that was 10, 15 minutes away. It's actually when we a better experience with both customers, if we offer that to someone else who's closer to this trip and then, you know, we know what's going to happen in a few minutes. So using this volume as historical data to predict and then the experience through predicting.

Nadiem: Yeah, I can't think of a more powerful combination actually in terms of that from a data perspective between GOJEK and Google, in this region. I think that's one of the things that I can confidently say is one of the most transformative things that we can do. Right. I mean just for example, going back to the public service model of a transportation model, say in Jakarta, which is sol complex and has so many infrastructures changes like the amount of data and ability to actually use that data into insights between Google and GOJEK, just for a public transit system. Like those are, those are sufficient data points with which to be able to give really important recommendations to governments, institutions, and public transport and cities.

Andrew: We haven't even touched on that, but the value that GOJEK has to do governments all over Southeast Asia is instrumental. Helping them plan, you know, future future things as well as identifying current problems and infrastructure. We know we can highlight intersections where traffic accidents happen too much, but people have to stop very quickly. All of these things. We have a big role to play in that in the future as well.

Maneesh: Aslo as this whole ride sharing market is on the rise, right? Many of the users are now choosing convenience over maybe some of these public transportation systems. What that means is fundamentally this whole city is changing it's traffic patterns at a very fundamental level. Yes, our government's really at the stage where they can, you know, because some of the city has to change to this change in behaviors, right? So for example, certain buildings have a much higher peak off cars coming in on the pickup zone and maybe that needs an extension, right? And they need to either have multiple pickup points or, you know, those kinds of changes. So all those information currently kind of not, uh, gets utilized. And I think also at a very fundamental level, how the whole city moves and how you can optimize looking at all these various modes of transport is something, you know, uh, which adds a lot of value to government.

Nadiem: Absolutely. Well, let's hope that in the future we can actually have more time to actually help not only our own company, but also other public institutions because just the sheer capabilities that we have, it would be a shame if it wasn't shared with the rest of the world and it was just, you know, for our company success, which does help a lot of people. But at the same time, thinking beyond that, I think is going to be one of the biggest sources of motivations for our employees, uh, and the partnerships that we strike. Guys, it's been a fascinating conversation. Thanks so much for being here.

Andrew: Thanks for having us.

Maneesh: Thank you.

Outro: Hey guys, hope you enjoy the podcast. If you liked it, please hit like, subscribe and follow us on social media. Thanks so much for tuning in.

See Full Video