Data-Centric AI — The Rise Again of the Data Engineer

Published in

Walmart Global Tech Blog

6 min readNov 18, 2021

DataOps vs MLOps

Around 2017, Maxime Beauchemin wrote The Rise of the Data Engineer which talked about the history of the data engineer and the reason why the data engineer plays a more significant role than before. I learned a lot from this blog and highly recommend others read it even it is a 4-year-old blog.

Earlier this year, Andrew Ng presented an interesting topic about From Model-Centric to Data-Centric AI which shows how important data is for machine learning (ML), even much more important than the ML model itself.

Data engineers build tools, infrastructure, frameworks and services — all of which are related to data, which means data engineering will play a significant role in processing data that can help with data-centric AI.

Agenda

Data-centric AI versus Model-centric AI
Big dataset versus Good dataset
DataOps versus MLOps
Data engineer versus ML engineer
Why data engineering and how it will impact the data engineer role for MLOps?
Summary

Data-centric AI versus Model-centric AI

Figure 1 — Source credit: MLOps: From Model-centric to Data-centric AI by Andrew Ng — Reference 1

Based on Figure 1, we can find the difference between model-centric AI and data-centric AI. In the AI research area, most of the time, researchers focus on modeling improvement or apply different modeling in some new areas. After more and more robust neural network models have been developed, it’s not that easy to improve the performance just based on the modeling tuning.

Figure 2 — Source credit: MLOps: From Model-centric to Data-centric AI by Andrew Ng — Reference 1

You can see the performance metric in Figure 2 doesn’t improve the performance very much. However, for the data-centric approach, you can see it improved the performance more than model-centric. Let’s focus on data itself now.

Big dataset versus Good dataset

I don’t have a preference on big dataset versus good dataset. For me, they’re more like balanced brothers. Having a big dataset can help to get or filter a better good dataset. Having a good dataset can reduce the size of the dataset for model usage.

For some areas of ML, more data is more helpful is always true. But not for some other ML areas such as speech recognition, like Andrew mentioned in his video I linked to above. Also, high-quality data is always helpful for modeling.

Then there is another question: What kind of data size is good enough for the model? I would say if it meets the two points below, then the data meets the first stage to use.

The data should have a good distribution which means it needs to have good coverage. Sometimes, we can’t get good coverage, then more data is needed. Big dataset allows selection of data for the needed coverage.
Enough high-quality data contains both input x and label y.

DataOps versus MLOps

Before we deep dive into DataOps versus MLOps, let’s focus on one data flow process from end to end for one data product.

Based on Figure 3, we can see there are three common areas (data ingestion, data quality check and data transform/loading) for two different data products. Those three areas focus on making sure we can get high-quality raw and catalog data. Then the data will flow into two branches — one is a data processing product, another is an ML product.

For a data processing product, there are two more components — data denormalized and data speed layer loading which both support high-performance analytics by end-user. After moving to production, those kinds of components need to be supported by DataOps.

DataOps (data operations) is an agile, process-oriented methodology for developing and delivering analytics.

For an ML product, there are four stages — model training, model evaluation, model validation and model prediction. After moving to production, those kinds of components need to be supported by MLOps

MLOps or ML Ops is a set of practices that aims to deploy and maintain ML models in production reliably and efficiently.

Data engineer versus ML engineer

Data engineers and ML engineers are both sub-areas under software engineers who deal with data, and they have a lot of areas of overlap.

A data engineer is focused on the infrastructure and workflows powering an organization’s data.
ML engineers do similar tasks to data engineers, but for ML models.

Both are working on the pipeline and infrastructure. Let’s deep-dive into both of them.

From bottom to top, a data engineer normally does the following responsibilities:

Build the infrastructure for the whole data flow.
Build, optimize and maintain data pipeline.
Build the system to deliver the data product.
Data modeling.

From bottom to top, an ML Engineer normally does the following responsibilities:

Build the infrastructure for the ML system.
Build, optimize and maintain ML pipeline for data science models usage.
Scale the data science models to production-level.

We can find both roles working on programming, monitoring, infrastructure, pipeline, generating high-quality data, etc. Data engineers need to know more about data modeling for databases (structure or non-structure). On the other hand, ML engineers need to know more about ML modeling and business acumen.

Why data engineering and how it will impact the data engineer role for MLOps?

Data-centric AI and MLOps are a new world. In the current industry, there are no boundaries.

If you are a data engineer, it makes more sense to work on how to generate consistently high-quality data systematic, monitor data for concept drift/data drift, etc. But is that true? The answer is fifty-fifty. It’s true that high-quality data and monitoring data belong to the responsibility of the data engineer, but there is little difference between data of data engineering and data of data science modeling.

For data of data science modeling, it contains input x and label y. Currently, data engineers work mostly on the preprocessing part which is part of input x. But how to get or generate label consistency is another side of data quality.

If data engineers want to work on more MLOps, they shouldn’t just for preprocessing the data, but also need to do more interaction within the whole ML system like below.

Need to deep dive more into the data quality of the data science model.
Create more reusable components which can be migrated into ML system or pre-ML system.
Make system to microservice.
Learn basic ML models and systems if needed.

The benefits the team can get if the data engineer works on MLOps.

Make more connections between the data engineering system and ML system to reduce duplicated data work.
The data engineer has strong programming and big data skills which can improve the performance of the data process.
The data engineer has strong system infrastructure skills for data processing.

Summary

For data-centric AI, DataOps and MLOps, data engineers can contribute a lot, not just to data quality, monitoring and data processing, but also ML pipeline performance and ease to adopt.