Training Production-Grade Machine Learning Pipelines

A few thoughts on how machine learning models can be scaled, stored, and used in production applications.

Choosing, training, and testing the right machine learning classifier is a difficult task: you have to preprocess and analyze your dataset’s features, possibly extract new features, tune hyperparameters, and perform cross-validation, just to name a few components of a typical machine learning problem. After you’ve trained and tested a reliable classifier, it’s ready to be deployed to serve new predictions at scale. These machine learning systems that are trained on a massive amount of data coming from a variety of sources can be hard to maintain and scale up. This post is a few of my thoughts on deploying a machine learning architecture, specifically using Amazon Web Services.

The Multi-Model Architecture

Our machine learning system has to be capable of a few different tasks:

  • It needs to efficienty store data, as well as pull data from several different sources.
  • It should be capable of automatically re-training and testing itself. Since new data is always flowing to our system, it’s probably not a good idea to train our model only once on an initial dataset.
  • The time-consuming training phase should occur offline. When the model is trained, it should be deployed such that any arbitrary event can trigger it.
  • A user-friendly interface is essential for developers to manage the training, testing, and deployment phases of the machine learning system.

For the above reasons, I’ve found the tools and infrastructure offered by AWS to be very helpful. Specifically, I’ll be talking about how we can use EC2, RDS, S3, and Lambda to build out a production-grade architecture.

The Architecture

Our architecture is composed of many pieces that interact with each other to train, deploy, and store our machine learning models. Here’s an overview of how our architecture could work, with details to follow:


Let’s review this model piece by piece.

Storage Components

This model uses two storage components: RDS and S3. RDS (Relational Database System) is a relational database stored in the cloud, and acts as our datawarehouse: we can efficiently query for data when we are testing or training our model. S3 (Secure Storage Server) will store our machine learning models as serialized data transfer objects. We’ll send these objects to other components when they need to be used or updated. Here’s how a serializable Neural Network object could be represented - using C#’s DataContract paradigm:

Offline Training

Training highly accurate machine learning algorithms with a lot of data can take a really long time. The training phase should occur offline (ie, separate from our application’s use of it) and on separate hardware. This is because training is a typically CPU/GPU intensive process, and dedicated hardware can result in faster training times, as well as separating the training concern from your application. Amazon EC2 (Elastic Cloud Compute) provides compute power on the cloud as a service - you can recruit new instances when you need them, and terminate them when finished (such as when all your models are trained). EC2 allows you to quickly scale your compute resources and configure additional instances quickly.

We can delegate the process of training our machine learning model to EC2. EC2 will be responsible for pulling data from RDS, training a model, testing and validating it, and sending that model to be stored in S3. Additionally, we’ll need to retrain our model as new data becomes available. To do this, we can use a popular queue-based paradigm to manage the training jobs we need to get done - this is the “Training Request Queue” in our model above. Requests for training or re-training a model can be generated by our application when enough new data becomes available. Here’s what a serializable request object might look like:

These requests are lined up into a queue, from which a pool of EC2 instances can pull from. Then, the instance can parse the training request, which involves obtaining the data needed from RDS and information about the particular type of classifier needed. After training, the instance sends the new object to S3, and is ready to pull another training request. If there’s no more training requests, we can easily terminate the instance so as to not waste compute power.

Making Predictions at Scale with Lambda

We’ve discussed storing the relevant data and objects we need, as well as training our classifier using EC2. Now, it’s time to use our trained classifiers to serve prediction requests at scale. Lambda is a great option for this. Lambda employs a serverless architecture - you can run code without having to manage any servers or a backend service. All you have to do is upload your code and define when it should be executed, and Lambda will take care of the compute resources needed to run and scale your code.

Our Lambda function can simply be the relevant fit function from our trained machine learning classifier - a function that takes our classifier’s weights and applies them to our input dataset, and returns the predicted label. It’ll be responsible for loading the serialized model from S3, deserializing it, and outputting the prediction. If we’re training several different machine learning classifiers, we can deploy independent Lambda functions and invoke the relevant one. This way, each function represents a single model that solves a single problem.

Along with writing the code for our function, we’ll have to define triggers that invoke our function. These can be nearly anything - API requests, updates from S3, or explicit calls. This makes it easy to turn our machine learning applications into several reusable microservices.

And that’s it! Having a well-defined machine learning infrastructure to use in production makes it easier to scale up, encapsulate different tasks, and quickly track problems when something’s not working. There’s definitely a lot more to doing machine learning at scale well - such as extracting the right features, preprocessing your dataset, and choosing the right classifier for the task. Thanks for reading!

Written on October 1, 2016