Mozrt, a Deep Learning Recommendation System Empowering Walmart Store Associates with a Personalized Learning Experience

Published in

Walmart Global Tech Blog

12 min readNov 4, 2021

Business Context

Walmart employs nearly 1.6 million associates staffed at over 4,700 stores in the U.S. Each associate is responsible for completing various tasks, which are often updated and revised based on their daily schedule and assignment. To complete these tasks professionally and accurately, protocols are set up for each task. Providing up-to-date, easily-accessible and relevant information to associates is instrumental to the success of associates and Walmart. We developed Mozrt, a deep learning recommendation system for Walmart Academy App, the training content portal for Walmart store and Supply Chain associates. Walmart Academy App is available on all company-managed mobile and desktop devices. Each time associates log in, Mozrt provides a series of recommendation carousels to help Walmart associates find the right contents at the right time while serving customers on the sales floor.

Proposed Model Architecture

The overall system architecture is a two-stage recommendation system, shown in the diagram below. There are two major components: 1) content candidate generation 2) content ranking algorithm. Candidate generation quickly filters out candidate contents with a low chance of being selected for final recommendation and generates a short contents list for the next step. The ranking algorithm uses the shortlisted contents generated from candidate generation as input to re-rank and generates the final recommendation carousel of contents.

When handling very large content candidate pools, a two-stage recommendation system provides an effective solution. For example, with 10,000 contents, needing 10 milliseconds each to generate a prediction score, the ranking algorithm would require 100 seconds to rank all contents. That wouldn’t be acceptable for real-time applications. With a two-stage recommendation system, content candidate generation filters out approximately 9,980 contents. The remaining 20 contents can be processed by the ranking algorithm in 200 milliseconds.

We use collaborative filtering and content-based similarity models as our candidate generation system to estimate the associate’s content needs based on historical views. Then, the content candidates flow into the deep learning ranking model. The ranking model will consider associate information such as job info, work area, login time/date, and content embedding from a skip-gram algorithm to predict with high accuracy what the associate needs every time they interact with Walmart Academy App.

Model Training Step One: Candidate Generation

Collaborative filtering candidate generation

Item2item collaborative filtering is one of the most popular algorithms in the recommendation system space. For example, in the Mozrt learning content recommendation system, we consider each learning content as an “item” and use the click history generated by different associates to identify each item’s location on vector space. After that, we find top K nearest neighbors and store them. This algorithm selects the first group of content candidates as input for the deep learning ranking algorithm.

Content-based (NLP) similarity candidate generation

Collaborative filtering may not be able to find all contents that are valuable to the Walmart associate. For example, the user click matrix of newly-created contents or contents with little viewing history will be very sparse. As a result, it will be challenging to find accurate locations of these contents in vector space. Thus, a content-based similarity model is proposed to generate a second content candidate group as input for the deep learning ranking algorithm.

TextRank-IDF keywords extraction

For each learning content webpage, we obtain a series of keywords by a hybrid text summarization technique (TextRank-IDF algorithm) and store them.

We consider each word in the learning webpage text as a node in a graph, and any two-word pairs in a context window are considered to have an undirected edge. We denote the context window as [W1,W2,…Wn] and assign the context window size to four words. For example: [W1,W2,W3,W4].

We use the PageRank algorithm to calculate the importance score of each “word node” iteratively. Then we adjust the importance score by inverse document frequency (IDF) value. This process will weigh down words that frequently appear in most learning contents (for example, Walmart, associate, etc.). Words with the highest adjusted importance score are used as keywords.

Content similarity calculation

We compare the keywords from one learning content page with all the ones from a second page to calculate content-based similarity scores. Then aggregate the similarity scores of all keyword pairs into one score. The final similarity score is calculated from embeddings generated from a word2vec model. This model is trained from the entire Walmart learning content corpus composed of more than 7,000 articles.

Figure 4. Content-based similarity by keyword pairs. Note: URL_keywords refers the content in given URL.

We find the K most similar learning content page and store them, where K is the number of similar learning content pages.

Model Training Step Two: Ranking Algorithm

Mozrt uses a Deep Factorization Machine algorithm as a ranking algorithm[1]. Training data is collected from associate engagement with the contents (clicks, views, time), and hyperparameters are determined by off-line bayesian optimization.

Input: Skip-gram content embedding

Any entity can be represented by a series of numbers, including learning content webpages. In machine learning terminology, this process is called embedding. Recently, many tech companies developed algorithms to obtain item embedding from user-item interaction data, such as Airbnb’s rental listing as embedding and Pinterest’s Pin as embedding in their Pin2vec algorithm[2][3]. Most of these algorithms are developed using the neural language model (Word2vec)[4][5].

Classic Word2vec model has two essential components: word and sentence. If two words are located in the same sentence, and their distance is no longer than a certain length (context window), we consider them as “neighbors.” Word2vec model randomly assigns numbers for each word as initial embedding. During the training process, each word’s embedding is used as input for a neural network model and their neighbors’ embedding as output (skip-gram). Updating word embeddings over many iterations in the neural network will achieve a final embedding for each word.

We adopted a similar strategy as Word2vec to create learning content-related inputs for the Mozrt’s ranking algorithm. We advanced the notion of word and sentence using the associate’s viewing behavior. Each content webpage is assigned a unique content ID. Content ID is treated as a ‘word’ and sequence of associate clicks as a ‘sentence’. Click sequence has various length, just like sentence in Word2vec model. The content ID sentence is defined as clicks by the same user where the time gap between webpages is no longer than 30 minutes.

Figure 5. Word embedding and content embedding

Afterwards, we built a skip-gram model to create an n-dimension embedding for each learning content, representing its location in vector space. We stored the embeddings as content-related inputs for the Mozrt’s ranking algorithm.

Wide and Deep architecture

After input data is prepared, we move onto the modeling step. For this step, the largest hurdle is balancing the capability to generalize and memorize. Let’s unpack this a little bit.

Traditional deep learning algorithms have a good generalization capability. However, these algorithms sometimes fail to “remember” patterns in historical data. For example, we want the recommendation system to “remember” to show turkey to a buyer before Thanksgiving, but not over-generalize the knowledge from historical data and recommend turkey before another holiday such as Labor Day.

In 2016, Google researchers proposed a Wide & Deep architecture[1]; they combined a wide component with a deep component to improve the deep neural network’s memorization ability. This architecture was optimized by a few companies and research institutions (Huawei, Facebook, etc. [6][7]) in the next couple of years.

Algorithm: Deep Factorization Machine

Now that we have selected a Wide & Deep architecture to build our ranking model and using the wide component to increase the memorizing ability of the model, the next challenge is to estimate the parameters of interaction features in the wide component.

Mozrt uses Deep Factorization Machine[6] as a ranking algorithm, one of the best performing algorithms in wide & deep architecture. The structure of DeepFM is shown in figure 7.

Figure 7. Deep Factorization Machine [6]

The wide component is a general linear regression. It is responsible for memorizing historical information. The major challenge of this component is how to reduce the number of parameters (weights) of a large number of potential interaction features, given every integration feature has one parameter. For example, if we have 2,000 features, and each feature might interact with every other feature, the potential interaction features will be almost 2 million. Thus, we use a factorization machine to estimate parameters of interaction features. We assume each weight of interaction features can be factorized into two latent factor vectors and only fit the latent factor vectors while estimating parameters of interaction features. If the latent factor vectors dimension is four, we only need to fit 8,000 parameters. With the help of the “factorization machine,” DeepFM can automatically define interaction features in a large-scale recommendation system, dramatically reducing the number of parameters in wide component.

The deep component is a feed-forward neural network. It is responsible for generalization, which explores the features combination that never appeared in historical data. A critical layer in the deep component called dense embedding transforms one-hot coding features into a dense (fewer dimension) matrix with fewer numeric values (similar to word embedding and transfers more than 10,000 one-hot coding into around 100 dimensions.)

Model Retraining Pipeline

We used Airflow to implement re-training pipeline for the deep learning recommendation system. In this section, we will review the rationale for re-training frequency, data drift and model quality checker.

Re-train frequency for different model components

Figure 8. Structure of Airflow re-train pipeline

DeepFM model has a strong generalization ability. Even if input content embeddings never appear in the training data, DeepFM algorithm will be able to predict accurately. Re-training a deep learning model consumes a large amount of computing resources; thus, we re-train monthly.

If a newly-created content does not have an embedding, it will not receive a prediction from DeepFM algorithm, and we need to re-train the content embedding model more often than DeepFM model. Therefore, the content embedding model is scheduled to re-train weekly.

Candidate generation models are the foundation of the recommendation system, especially in Mozrt. If DeepFM does not produce predictions, collaborative filtering and content-based similarity models will serve as a “back up” and generate the final output. Thus, we set up a daily re-training schedule for these two models.

Data drift detection

We built a series of data drift detectors into the deep learning recommendation system, monitoring anomaly outliers and unexpected input data distribution. If there is any significant data drift, the re-training process will be suspended, and an automatic email alert will be sent out to the development team.

Model performance checker

We set up decision nodes to check if all model components run properly and performance meets criteria during the re-training process. For example, if the new DeepFM model’s AUC is smaller than 0.7, the model will not be registered on Azure, and an automatic email alert will be sent out to the development team.

Recap

In Mozrt, there are two candidate generation algorithms. While collaborative filtering roughly selects learning content candidates by associate view history, content-based similarity selects another group of candidates by content keywords similarity.

Each content has a unique content id. We use content id as “word” and sequence of associate clicks as “sentence” and run a skip-graph neural language model to acquire embedding as content related input of deep learning ranking algorithm.

DeepFM algorithm outputs the final rank for learning content candidates. The wide component, a general linear regression with factorization machine parameter estimation, is responsible for memorizing historical information. The deep component, a feed-forward neural network, provides strong generalization ability. These two components work together to predict the learning content the associate needs when they interact with the app with high accuracy.

A data drift detector and model performance checker are set up on the model re-training pipeline to guarantee each updated model version will work properly.

Brief Sample of Success

The Mozrt learning recommendation system can remember historical views and “predict” what content associates need. For Example, Donna is a team associate. When she logs into the Walmart Academy App on Wednesday at 9 a.m., she will see an article, “Complete Price Changes” from the deep learning recommendation system, because the price change process is usually implemented in the morning of a working day. If she logs in around 3 p.m., she will get the recommendation “Deposit Excess Cash,” because this process is usually needed in the afternoon.

Figure 9. Recommendations change at different time

Relationship to Other Recommendation Carousels

We orchestrated multiple carousels in Walmart Academy App to provide associates a comprehensive learning experience. While the Based on your view history carousel provides highly personalized learning content recommendation using deep learning recommendation system, the Trending Now and Popular with your team carousels display the most popular learning content webpages across the entire company and those within an associate’s team. The Base on customer feedback carousel scans customer voice data to recommend learnings to improve customer experience.

Thoughts of Implementing DeepFM

A recommendation system is not a single model but a complicated system with multiple models, data pipelines, and orchestrators. Each component plays a vital role in providing accurate recommendations. Therefore, developing each component sequentially IS NOT the best strategy. Not only does this delay the product release, reduce the throughput, but also it leads to oversight of hidden technical debt (here is a more detailed discussion out of scope) and over-emphasis on models’ hypothesis and complexity from the get-go. To reduce the dependency among these components during the development process, we decided to start building the whole system architecture with a hackathon event, inviting experts in data engineering, architecture, network security, machine learning engineering and software development into a week-long, war-room-style discussion. We started by including a couple of heuristic models (frequency-based model and taxonomy-based model) in the first release of Mozrt. Once the DeepFM model was developed and tested, we quickly incorporated DeepFM into the existing system in the second release one quarter later. This strategy significantly shortened the development life cycle of Mozrt and improved the throughput.

Acknowledgement: We are People Data Science, part of Data, Strategy and Insights in Walmart Global Technology. We build AI/ML solutions to enable digital, data-driven solutions for 2.2 million associates worldwide. Appreciation goes to the 10+ members of our Learning scrum team and our partners in Data Engineering, Associate Product, Enterprise Content Management and Learning Tech for making this happen.

Reference

[1] Cheng, H. T., Koc L., Harmsen, J. et al.(2016). Wide & deep learning for recommender systems. Proceedings of the 1st workshop on deep learning for recommender systems. https://arxiv.org/pdf/1606.07792.pdf

[2] Grbovic, M., & Cheng, H. (2018). Real-time personalization using embeddings for search ranking at airbnb. Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. https://www.kdd.org/kdd2018/accepted-papers/view/real-time-personalization-using-embeddings-for-search-ranking-at-airbnb

[3] https://medium.com/the-graph/applying-deep-learning-to-related-pins-a6fee3c92f5e

[4]Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. https://arxiv.org/pdf/1301.3781.pdf

[5]https://code.google.com/archive/p/word2vec/

[6] Guo, H., Tang, R., Ye, Y., Li, Z., He, X. (2017). DeepFM: a factorization-machine based neural network for CTR prediction. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence. (IJCAI-17) https://www.ijcai.org/proceedings/2017/0239.pdf

[7]Naumov, M., Mudigere, D., Shi, H. J. M. et al.(2019). Deep learning recommendation model for personalization and recommendation systems. https://arxiv.org/pdf/1906.00091.pdf