My friends and colleagues know I have long been fascinated by machine learning technology. Although I am not a researcher, I have been applying machine learning technologies to business and consumer problems for almost 20 years, mostly in early-stage startups. In this two-part post, I hope to share my experiences building machine learning startups from the ground up. In sharing my experiences, I intend to highlight some of the big challenges that I have run into repeatedly so that other entrepreneurs can be prepared for these pitfalls if they come up.
What is Machine Learning?
For those of you who aren’t familiar with machine learning, there are many great online resources available to learn about it. In brief, here’s a definition from Wikipedia:
Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or infeasible to develop conventional algorithms to perform the needed tasks.
Progress in the field of ML in the last five years or so has been both astonishing and exciting. I remember borrowing a book on neural networks from a friend and mentor almost 20 years ago and thinking, “wow, this is pretty cool, and makes intuitive sense.” Upon reporting this to my friend, he was quick to tell me, “ah, nobody uses those things, their time has passed.”
But in this field what was once old often becomes new again. Today, the machine learning technologies built upon neural networks are amazing. Every morning I am excited (and distracted) to look at my personalized news feeds filled with articles on the latest advancements in computer vision, natural language processing and their supporting tools and concepts such as CNNs, contextualized language models, GANs, autoencoders, transformers, transfer learning techniques, and an endless advancement in toolkits, and pipelines, including MLOps, ML in the cloud, and ML at the edge. It seems everywhere you look, machine learning technology is transforming our lives with prediction, intelligence and automation.
For developers, entrepreneurs, and those interested in joining the movement, machine learning has become increasingly democratized with each passing year. With today’s wide range of accessible toolkits, it's increasingly tractable for students and practitioners to interact with and apply ML technologies to all kinds of new problems and domains. Not surprisingly, ML-driven technology startups abound.
But there is a catch - I learned a long time ago that quite often, machine learning companies don’t actually do much with machine learning technologies. How can this be? Instead, the core of the product and technology is more often about data. The simplest analogy to be made is that an engine can’t run without energy, no matter how high tech it may be. A car engine, for example, must run off of some kind of fuel, whether it’s gasoline or energy stored in a battery. It turns out that oftentimes procuring data for your machine learning application and company can be quite challenging, more so than building the actual machine learning technologies that make sense of or extract valuable patterns from that data.
Show Me the Data
When I joined Madrona Venture Labs as CTO in late 2016, I was already a bit fatigued from being pitched big machine learning startup ideas that lacked a data story. The dialog was often as simple as:
Them: So here’s my idea...
Me: Wow, that sounds like a cool idea...how do you plan on doing that?
Them: Deep learning
Me: OK, sure, the latest advancements in DNNs (Deep Neural Networks) should be helpful in solving that problem. What I meant was, where are you going to get the data to power the solution? And if you can get it, can you get enough of it? And, how will you label it (assuming a supervised learning problem), etc.?
Them: Uh...
The ideas I heard that fit this pattern were from many industries. I heard a lot of ideas in healthcare, involving sensitive patient data. I heard ideas that would require collecting data across an industry or a consortium of companies cooperating to share data for the benefit of the greater collective. I heard ideas that would require getting a ton of data from future customers before there was ever a product, service or value to offer them.
It’s not just through talking with other entrepreneurs that made me aware of these challenges - I, too, have been through many iterations of data challenges in my own startups and the kind of machine learning companies I work on in my day job. I have tried to get data from sources as varied as scraping proprietary data off of commercial websites, antiquated airline reservation systems, highly sensitive electronic health record systems, and corporate communication data. I have failed sometimes to achieve my objectives, but most times it just took much longer to do (and chewed up more resources) than I anticipated.
My experiences have shaped my perspective on what starting an ML company really entails. It’s mostly about data. After coming to this realization, I have made it a point to educate other entrepreneurs on this important factor, including baking it into our startup workshops, presentations to aspiring entrepreneurs, and when advising individual entrepreneurs and startups.
In this post, I am going to go deeper into the topic of the role data plays in machine learning startups, the different types and sources of data, and the challenges you may face working with them.
Building a startup is not for the faint of heart, and data acquisition for a machine learning startup can be particularly difficult. But challenges can be overcome, especially if you know how to make strategic decisions about how you build and grow your company.
Know Your Moat
One of the first things to realize and remember about your machine learning application is your data may be your moat, not your ML technology. One of the amazing characteristics about the latest generations of ML technology is that it tends to become commoditized pretty quickly. Oftentimes, not long after a new academic paper is published describing a novel technique that yields a new high mark of performance, the authors (and sometimes large tech companies) make the source code and the training data available. Or, they release a pre-trained model that can be customized or fine-tuned with transfer learning techniques.
This isn’t to say you can’t come up with your own technology and defend it with a patent, but it is more likely that most startups will try to leverage the current state-of-the-art ML techniques available in open source and apply it to a unique business problem and customer. Your access to data, your head start on amassing data, and the cycle of improving your product through continuous collection of data may end up being your protection from your competitors.
Not All Data Are Created Equally
Let’s look at a few sources of data commonly used to power machine learning startups and review some of the challenges, and their corresponding pros and cons.
My Customer Data
“My customer” data refers to the startup’s own data about their customers, usually the behavior your customers exhibit by interacting with your product. Although this could be any type of behavior as measured by your telemetry and user metrics technologies, it is often the purchasing behavior of your customers, assuming you have an e-commerce site of some sort. Once amassed over time, this data is rich and powerful.
For proof of this power, just think about the insight and predictions Amazon, Google and Facebook are able to glean about their customers based upon a continuous cycle of collecting and analyzing this behavioral data over time. These companies have also been at the forefront of leveraging machine learning technologies to extract the incredible value that is inherent to this data.
Interestingly, some of these companies have also been extremely generous with releasing their ML technologies to the community. One might wonder, why give that technology away to potential future competitors? The answer points to their current state-of-the-art machine learning technology being ultimately less valuable than the data they have amassed and the data moat they’ve created.
For a startup, it can be hard to call yourself a ML company when you can’t apply ML technology until you have accumulated enough historical data on your users and customers. Investors will recognize this chicken-and-egg problem and also that every startup these days is calling themselves an ML company. I sometimes call these startups “tech-later,” meaning that once any company has been in business long enough, there will be data and analytics value creation opportunities, but it might take a long time to get there.
In order to get to that mass of data, where you can start applying machine learning technologies to give your product that extra edge, you’re going to need to build momentum by some other means. The application of collaborative filtering and recommendation technology, for example, can’t help you until you have enough observations of your user’s behavior. Or perhaps you’re developing a proprietary matching algorithm that will power your 2-sided marketplace application. Again, you’ll need plenty of data for it to be effective. Before you jump into this arena, think about how you're going to bootstrap your application before you have a critical mass of customers and data.
Operations Data
This kind of data could be a customer’s server log files or it could be their communication data (chat, email, calendar), their financial operations data or any other data collected via the use of SaaS core systems of record applications. This is a really exciting type of data that in the past may have just been laying around, serving a single purpose (e.g. for communications or operations). Enterprises are just starting to realize there is value in this data that can be extracted with the latest machine learning technologies, especially unstructured data and NLP technologies. These days, almost everything a company does as part of its operations leaves a trail of data somewhere that may contain value if it can be collected, analyzed, and presented to a user as actionable insights, answering questions like, “how can we run our company more efficiently?”
One of the biggest challenges of leveraging this kind of data for your product is data sensitivity and security. You may convince a few pilot customers that you have the machine learning knowhow to extract actionable insights from their “exhaust” data. But do they trust you and your startup to handle and secure their data appropriately? And does the mere fact that someone or something is looking at this data feel like surveillance? A CISO at a Fortune 500 company you’re hoping to land as a pilot customer (and may unlock your seed round) agrees to engage with you, but then sends you a 200-item questionnaire that makes your heart stop, asking among other things, “please describe the composition of your change control committee” (my personal favorite). This can be quite scary.
You may be a 5-person startup with minimal cloud infrastructure and resources to create a highly secured environment, but building security into your culture from day one is very important. Having worked on several of these types of startups, I know these security questions and scrutiny will come. SOC 2 compliance audits are becoming an increasingly common request of vendors for these reasons. Get ahead of these requirements and don’t be caught off guard. If your startup is going to play in this data space, plan on (and budget for) getting a SOC 2 certification (offered by MVL company, Strike Graph) in your first year of operations and make sure your initial infrastructure is simple and secure, including adhering to best cultural security practices (MFA, no data outside the cloud on laptops, strong password disciplines, encrypted data at rest, etc.).
One more tip on operations data. If you land pilot customers, and you get through their security audit, you might think you’re good to go. Not quite. The data may be behind a number of secure APIs that you will need to build “connectors” to access. Not the end of the world, but if there are many types of data sources you need to access, your connector strategy may become complex and resource intensive. Given limited resources, think carefully and strategically about the first few connections you build (or buy) given your early pilot customers’ requirements.
Industry Data
I’ll forever think of airline data when I think of this data type. I was involved early in a startup called Farecast, where we predicted changes (future price direction) in airfares. We relied heavily on a complex stew of data from individual airlines, some of which was filtered through a 3rd party organization and ultimately computed into a priced airfare a consumer could buy. We collected a lot of this data over time in order to have an historical view of price changes.
It turned out that there were only a few companies in the world who had the knowledge, expertise, and compute resources to cook up the amount of data we needed daily. As a defensive measure, we ended up spending a lot of time and effort building our own expertise in the construction of airline data and pricing in order to make the most of the data we were purchasing and to engineer appropriate features for our ML algorithms.
At Farecast, we ended up having a number of data vulnerabilities that we hadn't fully considered at the outset of our journey: 1) procuring the data, particularly in the volume we needed it, 2) the cost of the data (I’ll just say the cost was non-trivial and a significant portion of our operating budget), and 3) the need to build up an historical cache of data, which takes time and may need to cover at least a few cycles of whatever signals your are trying to predict or forecast.
Be careful with this type of data. If you’re able to get your hands on it, which might be more difficult if you are an industry outsider, remember that your competitors will probably be able to get it too. The fact that you’ve managed to crack the data procurement challenge will be a signal to other startups that they’ll be able to do it as well.
Public (or Quasi Public) Data
There is a ton of public “open source” data out there that is released regularly by governments and public and private institutions alike. The last 10 years have seen a deluge of this kind of data released to the public (e.g. crime data, census data, medical research data, etc.). What is nice about this kind of data is it is typically free, and there can be large volumes of it available. The challenge can be that not only is the data accessible to you, but it is also accessible to your competitors. Remember you need a moat. With this kind of data, your edge may be your cleverness, perhaps figuring out how to combine or mashup multiple data sources (even mixing in proprietary data) in a unique and powerful way, and then applying your ML technology solution.
One experience I had with this type of data was at a startup I co-founded called Medify. At Medify, we mined medical research literature with NLP/text-mining technologies to extract signals about what treatments and conditions were studied, the patient demographics, and the outcome of the studies. We rolled up this data across thousands of studies and were able to present the user with a powerful tool to see aggregate information on treatments, their applicability and efficacy across a wide swath of patient cohorts. The data was freely available from PubMed (as were complex medical taxonomies supported by the NIH), was semi-structured, and provided in lovely XML format. The challenge was ultimately deciphering the unstructured parts, which was natural medical language. That’s where the technology challenge came in.
Again, try to come up with creative mashups of multiple data sources (perhaps some public and some proprietary) that ultimately give your data a proprietary edge. If you’re developing a supervised ML solution, the proprietary part may be your labeling strategy or process. This won’t hold off competitors forever, but may help give you a head start.
Personal Health Information (PHI)
Unfortunately, it’s difficult to enumerate all of the different kinds of data you may encounter and need as you build your startup, but one more type that’s important to mention is healthcare data, especially PHI data (protected health information or personal health information). In some respects, this is the holy grail of data.
Healthcare system data - doctors visits, medical tests, procedures, etc - is all recorded somewhere. And no doubt, if you could get your hands on a lot of it, you could likely build some very helpful machine learning applications for humanity. But, this type of data can be very difficult to procure. And it is not just other people’s PHI that is difficult to get. Our own PHI data can be difficult to obtain, even though we are legally entitled to copies of it if we so choose via the HIPAA and Hitech acts.
Although HIPAA was meant to protect and ensure the privacy of individuals, it has certainly played a role making access to PHI data challenging and difficult. Of course there are other complications and friction in the healthcare industry that contribute to the lack of accessibility of data for technology applications, including the motivations of the two largest EHR companies and arguably the healthcare industry culture in general.
I have wandered into the healthcare space before and saw a myriad of problems I could solve with the latest and greatest technologies. But like many other technologists before and after me, I found that obtaining compelling data to be very challenging and time consuming. I’ll also say that it’s not impossible to get sensitive PHI data. I know of startups that have gotten over this hump. But it can take time and perseverance to build relationships and trust inside of organizations that can facilitate the access to the data you need. If you’re going to build a machine learning app in this space, be prepared to make your resources last so you can have the time you need to get access to the data. Also, more so in healthcare than other verticals, you should seriously consider having a co-founder with domain expertise who can facilitate the key relationships you’ll need to build in order to get your foot in the door.
Before finishing up part I of this post, I want to mention that procuring data is usually not a once and done endeavor. Startups should be thinking about a virtuous cycle of data, as articulated by Soma Somasegar and Daniel Li of Madrona Venture Group. The idea is that given your product is data dependent, the more that you collect the better your product will be. The better your product is, the more popular it will become, and the more data you will collect.
This concept might be easiest to understand in the context of the “My Customer” data type described above, assuming you have a consumer application (although the concept can apply to enterprise applications as well). By iteratively incorporating new data into your product, making it a better product at each turn, you are building layers of value that can be part of your mote and defensibility story. Be sure to articulate to your investors how you plan to exploit the virtuous cycle of data to the benefit of your product and customers.
Your data strategy will be of paramount importance as you consider starting your ML startup. In part II, I’ll dig into some experiences you may run into once you are able to procure your first bits of data and the challenges you may encounter when you run your first algorithms and get your initial results.
We are with our founders from day one, for the long run.