In 2020 I authored a blog post entitled “Anatomy of an ML startup.” Tapping into my own experiences building data-driven AI/ML products and startups and advising many Artificial Intelligence/Machine Learning (AI/ML) companies over the years, my premise is that budding technology entrepreneurs are often inspired to solve problems and build products and companies with the latest and greatest AI/ML technology. What they often miss is that data is likely the central factor in the technology and product; the actual AI/ML technology plays a supporting role.
That may be surprising to hear, so let me explain. New AI/ML ideas and technology tend to spread quickly, often becoming a commodity in short order. Access to the latest models and algorithms (toolkits, code, weights, etc.) democratize rapidly.
Further, training your own AI/ML solutions requires large amounts of data (less so when fine-tuning models). Procuring and massaging the data needed for machine learning can be quite challenging.But data has an even more significant role in venture capital-backed startups beyond just training models. If you are building a startup that you hope will be backed by venture capital, the defensibility of your solution must be a central factor in your strategy.
Given that machine learning technology typically becomes commoditized relatively quickly through the AI community, academic research papers, and open source software (OSS) (often contributed by big technology companies like Google, Meta, MSFT, Uber, etc.), AI/ML technology by itself won’t be a very effective moat for your technology and company. This is even more true with today’s LLMs (commercial and open source) available to the masses. As I often find myself saying, the quickly disseminating AI/ML technology advances raise all boats.
In the new world of pre-trained models, foundation models, LLMs, etc., most investors still agree that proprietary data still is, for the most part, the foundation of a moat and a defensibility strategy.
Before delving into different shapes and forms of data AI/ML startups leverage, let me say a little more about defensibility. Patents have not been a solid defensive moat for software startups for as long as I can remember. Inexperienced entrepreneurs often look to a patent strategy for defensibility. If a large company appears to infringe on a startup’s ideas and patents, the legal costs required to bring a claim are often insurmountable for a small startup. It comes down to who has deeper pockets.
Also, your defensibility strategy is not an all-or-nothing proposition. Typically, a defensible moat is built over time and consists of several components or layers. An investor will want to hear a convincing story and strategy for building your moat. The sooner the moat is built, and the bigger it is, the better.
Looking back on that original post, much of it is still sound and valid in the rapidly changing worlds of AI/ML and startups. But the marriage of data and defensibility is perhaps a little more complicated than it once was just a few years ago.
Large Language Models (LLMs) have been astonishing both the technorati and the general public with their capabilities for over a year. For a technology entrepreneur, LLMs offer a myriad of powerful product possibilities that would not have been possible even a year ago. But, like previous generations of machine learning technologies, the introduction of LLMs to the technology ecosystem benefits a broad spectrum of company sizes. For many types of intriguing AL/ML applications, what might have required the resources of a well-funded startup or even a large tech company just a few years ago can often be implemented by a couple of college students in a weekend hackathon.
In the new world of LLMs, the strategy for your moat may need to have layers - a “moat sandwich,” if you will. As I mentioned, the accumulation of proprietary data has been and continues to be one of the most important layers of your strategy. But in this new world we’re living in, you have to think harder about the layers of your moat.
In addition to data, one layer of your moat can be domain knowledge and expertise. If your startup is in a space that is just starting to adopt AI/ML, and the domain requires special expertise that can take years to accumulate and master, you may have some advantages that help you build a compelling initial product that will get your customer traction going, which is another layer in the moat.
Even with a powerful commercial LLM like ChatGPT at your disposal, there can be significant challenges in persuading it to give you the desired output. The black art of prompt engineering required to prod LLMs to churn out the desired output is still an evolving practice. In a specialized domain, your expertise and investment in domain-specific prompt engineering may help you get better value from the existing models than your competitors, giving you another slice of your moat.
As has been the case for pre-trained models for several years, LLMs can be fined-tuned with relatively small amounts of additional training data (using techniques such as RLHF - Reinforcement Learning with Human Feedback) to produce proprietary models that have better performance. Smaller, specialized, fine-tuned models will become commonplace and require your technology team's honed expertise. This pattern is similar to the older AI/ML startup playbook. There is still some uncertainty, though. As I write this, it isn’t clear whether fine-tuning with RLHF vs. in-context learning techniques vs. clever prompt engineering is better and gives more bang for the buck. Surely, it depends on what you’re trying to accomplish. The toolkits and techniques for finetuning, RAG (Retrieval Augmented Generation), in-context learning, and prompting are constantly evolving, further leveling the playing field.
Since data can come in many forms, shapes, and sizes, in part II, I will discuss some of the major types I have encountered and worked with in my career, along with some of the challenges and pitfalls they may bring.
We are with our founders from day one, for the long run.