DOING ML EFFECTIVELY AT STARTUP SCALE

Hagakure for MLOps: The Four Pillars of ML at Reasonable Scale

MLOps without too much Ops — Episode 3

Ciro Greco

Published in

Towards Data Science

10 min readDec 6, 2021

with Andrea Polonioli and Jacopo Tagliabue

“Do not use a bow and arrow to kill a mosquito.”
Confucius

In the past episode of this series, we explored what we called the “Reasonable Scale” (RS) giving a somewhat loose definition meant to grasp the condition in which — we believe — plenty of companies find themselves right now as far as ML is concerned.

In light of the many, significant challenges RS companies have to face, this article will instead articulate a framework based on four main philosophical pillars while grounding such principles into real-world problems.

Before putting forward our practical and spiritual guide for the ML practitioner in the ever-changing landscape of modern MLOps — our Hagakure for ML at Reasonable Scale — let us briefly recall the main challenges RS companies face and address.

The constraints we described in our previous post have countless ramifications when it comes to making actual choices in real companies. This is partially due to the heterogeneous nature of constraints. A RS company must always factor in a combination of questions about engineering resources, data volume, data quality, team size, use cases and monetary impact of such use cases. The exact picture laid out inside a specific organization might depend on many factors and we should expect it to be somewhat unique to that organization, which makes it difficult to come up with a single recipe for success.

Things are not exactly made simpler by the bewildering number of MLOps vendors out there (see here and here). Although some companies are leaning towards end-to-end solutions that can be adopted across the whole ML stack, such as Databricks, the industry has recently been witnessing a quick and relentless movement towards functional consolidation. In other words, a growing number of companies (mostly startups at this point) are developing more mature solutions for all the steps of the ML cycle, from data warehousing to model deployment. Now, the development of such a landscape is overall contributing to offering a coherent set of solutions that can be integrated with one another, but the truth is that most players in this space move independently, leaving ML practitioners with the daunting task of finding the right way to choose their tools à la carte and put them together in a coherent way.

So, once again, like in every quest for wisdom that deserves to be called so, we are tasked with developing a principled framework more than pointing at one specific solution. And since we wholeheartedly subscribe to the idea that spiritual guides are only as good as they are useful in practice, we want to make sure that such a framework tackles the real-world problems of RS companies.

The four pillars of ML at Reasonable Scale

一 Data is superior to modeling.
二 Log then transform.
三 PaaS & FaaS is preferable to IaaS.
四 Vertical cuts deeper than distributed.

一 DATA IS SUPERIOR TO MODELING

The greater marginal gain for a RS company is always in having clean and accessible data: good data is more important than relative improvements in modeling and model architecture.

In general, it seems that the industry entered a new phase where model capabilities are getting increasingly commoditized, possibly because we have become so much better at modeling, that out-of-the-box models have gotten compelling enough. This is important because building very competitive models in house has become a bigger investment for RS companies and the ROI might not make much sense.

In this context, proprietary data flows are crucial from a strategic viewpoint, especially for RS companies. Take, for instance, the data-centric AI framework advocated by Chris Re and more recently by Andrew Ng which focuses on dataset optimization to provide a solid foundation where data collection is constrained, training samples are inherently scarce and iterations cannot be super fast (which, as you might recall for our second episode, are key principles of RS).

The first consequence of this principle is that data ingestion is a first-class citizen of your MLOps cycle and getting clear data should be the ultimate goal.

Simply put, data ingestion must include getting data through a standard. You can pick a domain-specific protocol that already exists (e.g. Google Analytics for browsing events) or come up with your own standard if your data are somewhat unique to your business.

The bottomline is the same though: two events that are meant to describe the very same thing can never have different structures. For instance, let’s say we are building a recommender system and we are collecting add-to-cart actions from different e-commerce websites: under no circumstances an add-to-cart event from Website A should be different than an add-to-cart event from Website B. For every exception to this rule there will be a price to pay sooner than later.

Closely related to the point on standardization, we can make an ancillary point about strict validation and rejection. In general, events should never be rejected, even when they don’t adhere to the standard format agreed upon. The system should flag all ill-formed events, send them down a different path and notify an alert system. While it is of capital importance that ill-formed events do not end up in our tables, it is also very important to know when something is wrong.

二 Log then transform

A clear separation between data ingestion and processing produces reliable and reproducible data pipelines. A data warehouse should contain immutable raw records of each state of the system at any given time.

As a consequence, the data pipeline should always be built from raw events and implement a sharp separation between streaming and processing. The role of the first is to guarantee the presence of truthful snapshots of any given state of the system, the latter is the engine to prepare data for its final purpose.

The crucial distinction is that the output of the streaming is immutable, while the output of processing can always be undone and modified. The idea here is somewhat counterintuitive if you were raised with the idea that databases are about writing and transforming data, but it really is quite simple at its core: models can always be fixed, data cannot.

The northern star of this principle is whether your system allows for replayability. The main question is: can you change something in your data transformation, in your queries or in your models and then replay all the data you ingested since the beginning of time without any major problem (time aside)? If the answer is yes, you successfully applied this principle to your data infrastructure. If the answer is no, you should really try to change that into a yes, and where not possible, strive to get as close as possible.

We will talk more in detail about data ingestion and in the separation between streaming and processing in the next episode of our series, where we will also share an open source implementation based on SnowFlake + dbt.

三 PaaS & FaaS is preferable to IaaS

Focus is the essence of the RS. Instead of building and managing every component of a ML pipeline, adopt fully-managed services to run the computation.

The main thing that characterizes RS companies is that they are, in one way or another, constrained in terms of resources for ML initiatives. Of course, when resources are limited, it is a good practice to invest them in the most important business problem. For instance, let’s say you are building a recommender system, you might want to stay focused as much as possible on providing good recommendations. As simple as that.

Of course, an approach that prefers fully-managed services tends to be more expensive in terms of hard COGS, but at the same time data scientists can stop worrying about downtime, replication, auto-scaling and so on. All those things are necessary, but that does not mean they are central to the purpose of your ML application and in our experience, the benefits of keeping your organization uniquely focused on the core business problems very often outweigh the benefit of low bills (within reasonable limits, of course). Because maintaining and scaling infrastructure with dedicated people is still going to be expensive, so in the end it can easily end up being much more costly than paying and managing a handful of new providers.

Moreover, the worst thing about building and maintaining infrastructure is that the actual costs are rather unpredictable over time. Not only is it extremely easy to underestimate the total effort required in the long run, but every time you create a team for the sole purpose of building a piece of infrastructure you introduce the quintessential unpredictability of the human factor.

How many people will be needed eventually? Will they be absorbed adroitly by the company in a year from now? Are moats being accidentally created within the organization because of excessive separation of roles or because some teams’ only purpose is to keep a piece of infrastructure up and running?

Consumption bills have very few positive aspects, but one they have for sure is that they can be easily predicted. If you want to project COGS to a larger scale, you can pretty much multiply your bills by whatever number you think captures the next stage in your growth path.

四 Vertical cuts deeper than distributed.

A RS company does not require distributed computing at every step. Much can be achieved with an efficient vertical design.

This is probably the most controversial statement we make, but stick with us, there’s a reason for it. Distributed systems, like Hadoop and Spark, played a pivotal role in the big-data revolution. Only five years ago Spark was pretty much the only option on the table to do ML at scale. Startups all around the world use Spark (and we made no exception) for streaming, SparkSQL for data exploration and MLlib for feature selection and building ML pipelines.

However, they are cumbersome to work with, hard to debug and they also force programming patterns unfamiliar to many scientists, with very negative impacts on the ramp up time of new hires.

The point that we want to make is simple: if you are a RS company, the amount of data you deal with does not require ubiquitous distributed computing: with a good vertical design it is possible to do much of the work while improving significantly the developer experience.

The general idea is that your ML pipeline should be conceived as a DAG implemented throughout a number of modular steps. Now, while some of them might require more computational firepower, it is important to abstract away the computation piece from the process and the experience of ML development.

Distributed computing should be used for what it is really for: solving vast and intricate data problems. For everything else, it is more practical to run the steps in the pipeline in separate boxes, scaling up computation in your cloud infrastructure at and only at the steps that require it, like Metaflow allows to do for example.

In our experience, this shift in perspective has a massive impact on the developer experience of any ML team, because it will give data scientists the feeling of developing locally, without having to go through the hoops and hurdles of dealing with distributed computing all the time.

Love thy developer: a corollary about vertical independence.

The last two points aim at abstracting infrastructure away from ML developers as much as possible. This philosophy has a very well grounded raison d’être.

We want to encourage vertical independence of ML teams. Much of the work in ML depends heavily on the type of problem solved, so data scientists need to be able to make reasonably independent choices about tooling, architecture and modeling depending on datasets, data types, algorithms and security constraints. Plus, ML systems are not deployed against static environments, so data scientists need to be aware of changes in the data, changes in the model, adversarial attacks, and so on. We favor vertical independence over embedding into highly compartmentalized organizations because excessive specialization in this context results in high coordination costs, low iteration rate and difficulties in adapting to environmental changes. Finally, we are painfully aware that of all the resources of which RS companies have to budget for, the scarcest one is good engineers.

We believe that vertical independence makes a great deal of difference when it comes to attract and retain critical talent in RS companies. We often hear that data scientists still devote a sizeable portion of their time to low-impact tasks, such as data preparation, simple analyses, infrastructure maintenance and more generally jumping through the hoops of cumbersome processes.

In our experience, good engineers and data scientists get excited about doing cutting-edge work with the best tools. A bad developer experience and not being able to see the impact of their work in production is one of the major reasons why data scientists get frustrated. And we all know that a frustrated engineer is much likelier to reply to recruiters’ messages on LinkedIn.

Vertical independence pays off in speed of innovation, minimality, leanness and freedom: the most valuable coin for RS organizations. Give your data scientist the possibility to own the entire cycle from fetching the data to testing in production, make it as easy as possible for them to move back and forth along the different phases of the whole end-to-end pipeline, empower them. Do it and they will amaze you by developing reliable applications with real business value.

Shameless cliffhanger for the next post

These are our principle for ML in production at RS scale. Adopting them allowed us to successfully work at scale while minimizing all the possible infrastructure headaches we would have otherwise had to deal with. Remember that everything we do is motivated by the constraints that RS companies are subject to. The ultimate goal is always to do DataOps and MLOps without too much Ops.

In the next post, we will dive deeper into the first two principle and work out a concrete example of pipeline for data ingestion and data processing completely replayable with Serverless, Snowflake and dbt.

Acknowledgment

This series wouldn’t be possible without the commitment of our open source contributors:

Patrick John Chia: local flow and baseline model;
Luca Bigon: general engineering and infra optimization;
Andrew Sutcliffe: remote flow;
Leopoldo Garcia Vargas: QA and tests.