Engineering of machine learning, the next starting point for AI in enterprises - Weekly Sharing

Summary : Now we're finally stopping to talk about network models and feature engineering that are related to the algorithms of machine learning itself. Companies are starting to focus on transitioning machine learning algorithms from a development environment to a production environment and forming an effective set of processes around which everyone can work. The appearance of MLOps is an important step toward the increasingly mature engineering of artificial intelligence.

The application of machine learning in the industry is becoming increasingly popular. Thus it has become a regular mode of software development. The industry's focus is gradually shifting from what machine learning can do to how to manage the delivery process of machine learning projects effectively.

However, such programs' development, deployment, and continuous improvement are more complicated than traditional software development, such as web services or mobile applications. But the good news is that through continuous practice, the industry has developed a set of agile engineering processes for everyone to follow and reference when it comes to continuous delivery.

1. Why do we need " Engineering of Machine Learning"?

Data is an enterprise's real asset, and all future enterprises should be data-driven. The digital transformation trend has seen enterprises apply machine learning to various business scenarios for fear of falling behind their competitors. Making machine learning valuable in real-world production is still a complex task. While the industry can now make algorithmic scientist models work, there are still some challenges.

The vast majority of systems that support enterprise machine learning today can be treated as complex subsystems, a system that is isolated from the business system. When talking about machine learning, many people imagine that business data can adapt to changes in applications and that machine learning and enterprise software can be combined. Still, in reality, there is a large gap between them and the ultimate ideal state, and the industry has not yet been able to find a way to integrate machine learning and business systems. In many cases, machine learning remains an external capability embedded in the business's main processes.

Machine learning code is only a small part of the machine learning application. Whether autonomous driving, credit card risk control, or image recognition, more than 90% of it comes from engineering. Engineers need to apply the algorithm scientists' models to specific software environments and know how to interact with the algorithm scientists. In practice, when deploying models in real-world environments, the models given by scientists often fail because they cannot adapt to the dynamics of the environment or the changes in the data that we used to describe. As the production of machine learning models is considered a separate skill for algorithmic scientists, we need a hybrid team to succeed. A successful team might include a data scientist or ML engineer, a DevOps engineer, and a data engineer. The engineer needs to maintain the quality of the model in production, retrain the production model frequently and try out new implementations to generate the model.

To overcome the challenges of this manual process, MLOps plays an important role in setting up CI/CD systems to test, build and deploy machine learning training pipelines quickly. By engineering machine learning to automatically retrain and deploy new models, the goal is to bridge the gap between engineers and algorithm scientists.

2. Why is machine learning so difficult to "engineering"?

In traditional software development, the pursuit of "efficient collaboration" and "quality control" is in a team approach so that the project's quality, schedule, and cost are under control. This is "engineering" capability. On the flip side, when a project is not engineered enough, there are two types of situations.

The first is when it relies on individual activities and does not require collaborative working with others, and then it is a weakly engineered embodiment. "In fact, we can reflect on the situation, even for application development, where some individuals in the team may put more emphasis on their individual contribution and individual activities and neglect the teamwork, which is also a manifestation of weak engineering. While not all application development is engineered, when we are in a situation of weak collaboration and strong individualization, it is a sure sign of a lack of engineering."

The second type is the inspiring exploration activity. Engineered activities are those that we already know where we are going, that the thing may have been done before by everyone else, or that we know of others in the industry who have implemented it, and that we can certainly accomplish one by one by following the steps and methods and breaking it down into smaller steps. The so-called exploratory activities are those in that we do not know how things will be achieved, we do not know the approximate process and sequence in which things will be done, we need to keep trying and exploring, and we are even not sure if we can achieve the final goal. We then consider this situation, not a process that can be fully engineered but a manifestation of weak engineering.

Machine learning is a typical non-engineered scenario, as it relies heavily on the individual capabilities of the data scientist, and the process is full of exploration. Machine learning requires constant experimentation and exploration of the data. Even with deep learning has reduced the requirement for feature engineering, the adjustment of the hyperparameter, how many layers of the network can be chosen, what kind of network architecture to use, etc., all rely heavily on the scientist's personal experience.

3. Why does machine learning engineering need a new system of tools?

A machine learning project differs from a typical software project, where the code and the data are separated. In a production environment, machine learning is a process of processing data, so there is no way to separate code from data as in traditional software. It looks like machine learning software is made up of code, but the data plays a more important role.

Combined with its weak engineering nature, machine learning is more difficult from an engineering perspective. But over the past few years, we have taken an experimental category that seemed completely impossible to engineer and turned it into a routine operation that can work with each other as well as other development methods, slowly turning it from an individual activity into a team activity, making its outcomes more and more controllable, and then plugging it into the enterprise ecosystem," which is a very interesting process."

In contrast to the traditional software engineering process, the first step in engineering machine learning is to build a collaborative development process. With version control, it ensures that engineers' products are shared during the construction process and that there are clear products that allow teams to hand over work if there is turnover. Many companies and organizations start by engineering development with strict traceability to version control.

The second step in engineering machine learning is to break down the isolation between the production and development environments, from the development environment to the production environment, where engineers know how to perform data and model updates, as well as operations and maintenance in the production environment. This is the same as development in the traditional sense of turning it into a collaborative team activity. Still, the content and tools presented in this will be more different.

In the case of version control, for example, the amount of data controlled by traditional software development is not large. With such a small amount of data and code in text form, changes to the code are easy to understand, so in traditional software versioning, the changes are made to lines of code as the change element.

But in machine learning, this is not the case. Machine learning does not only include codes but also models and data. Machine learning first has to focus on data; the format and shape of the data do not have to be plain text. It can have images, sounds, etc., modified in terms of image pixels or sound clips. Machine learning data does not stop changing and has no control over how it will change. So instead of defining a version for every change, we can look at the real data as a time-dependent stream and count many days of data changes as a period. So in comparison, there is a big difference between the data definition of a version and the code definition of a version, which is why we need the new tool.

4. Why MLOps is not an analogy to DevOps

Image Source: phData

The machine learning version control software DVC started to emerge in 2018, and DVC has since facilitated us to think about data science workflows. Developing code or models is only the first step in the development process. The biggest effort is to make each step work in production, encompassing all aspects of version control, model servicing, and deployment, integration testing, experiment tracking, etc., so that they can run repeatedly and automatically with minimal intervention.

Once all of the above engineerings have been sorted out, we still need to tie the steps together tightly. This is where business process setup tools for continuous delivery take effect. Thus the concept of MLOps emerged to make the whole process smoother and more continuous, allowing machine learning to be increasingly engineered.

MLOps is a cross-discipline of DevOps, data science, and software engineering, a set of practices and tools for deploying and reliably maintaining machine learning systems in production, "but it's quite different from DevOps and the two cannot be simply analogized."

In the traditional software development world, DevOps practices can deliver software to a production environment and keep it running reliably within minutes. Before DevOps, the software development approach was to develop the software and hand it over to the operations team. The operations center kept an eye on the machines and the rest of the actual physical execution architecture. In the cloud era, the software can start and control machines. Developers no longer need an operations person to do this, giving rise to the DevOps movement, where DevOps primarily controls the machines, and upgrades to the software are just seen as a machine with a new version to install.

As for the engineering side of machine learning, it is code and data used to train neural networks, focusing on tuning the parameters of the neural network, which is then deployed to a production environment for experimentation. Data scientists start with sample data, work on Jupyter notebooks, or use AutoML tools to identify patterns and train models. Still, there may be no direct correlation between the training-side environment and the execution-side environment. When data science teams try to deploy models into production, they find that real-world data is different, and it is impossible to use the same data and methods for ever-changing data.

Therefore, developers need to constantly re-collect data, train the model, or tweak the parameters, and then re-release it. The main problem that MLOps addresses are getting the trained results into the production environment. The people responsible for running and designing the neural network in the production environment may not be the same, so MLOps might be an automated process where everyone's work is tied together through a process.

For MLOps, it is less concerned with what the machine looks like and more concerned with how the machine learning parameters are now divided, whether the neural network structure has to be adjusted, how the parameters are adjusted, and so on.

MLOps is less concerned with the hardware environment in which it is executed. The only similarity between them is the process of operating and maintaining the final product in a production environment, which is similar in concept but very different in engineering practice.

5. MLOps are just the starting point

"Machine learning engineering is still in its early stages."

——Xu Hao, CTO of Thoughtworks China

Now we're finally stopping to talk about network models and feature engineering that are related to the algorithms of machine learning itself. Companies are starting to focus on transitioning machine learning algorithms from a development environment to a production environment and forming an effective set of processes around which everyone can work. The appearance of MLOps is an important step toward the increasingly mature engineering of artificial intelligence.

It's a new and exciting discipline with tools and practices that will likely evolve rapidly. It is still the beginning, not the end, and according to Xu Hao, it shows that "we have domesticated another new technology in the engineering community, allowing AI to provide a better solution in a specific area."