Stories From The Trenches: Part 2 of 3
In Part 1, we introduced a recent data science project with one of our Fortune 500 clients. They gave us the freedom to identify a high-value area of improvement, with the goal of initiating an internal data science practice. We started with a profit center: the sales pipeline. We quickly discovered there was a particularly leaky stage: Quoting. When talking with the lead solutions architect, we learned quote generation was a very complex and lengthy task and that new solutions engineers often made mistakes that reduced the quality of the quote and/or extended its turnaround time. We then defined several prospective technological solutions to the challenge, and we settled on an approach that would minimize burden on senior Solutions Architect personnel and instead capitalize on the large volume of historical requirements documentation. Our goal: Build an AI application that takes in Requirements Documentation and outputs a list of recommended products with corresponding confidence ratings and similar alternatives.
What is Requirements Documentation?
In our case, Requirements Documentation consists of a set of at least 6 and sometimes more than 20 files. The minimum 6 are considered “core” documentation, as they appear in every deal, while the others are only deemed required based on initial findings from the discovery. Some are more structured, containing tables or key-value pairs, which are easily extracted. Others are written more like prose, with requirements embedded in pages of natural language text.
Naturally, we’d prefer all data to be fully-structured because that’s easy to interpret. But that’s not how requirements gathering works. Prospective clients share their needs with solution engineers, who use a variety of templates and ad hoc note-taking formats to capture the information, typically under time pressure. Some information fits into a standard questionnaire, but often the questionnaire only stimulates the really important information, which is then captured as unstructured bullets or lengthy paragraphs of text. So we need to accommodate both structured and unstructured data.
An organization’s requirements documentation, with their corresponding solutions; and the production data arguably represents the gold mine share of the organization’s applied expertise.
So naturally, we want to see if we can use it to train artificial intelligence applications to automate and improve our business processes.
Dialing in the Deliverable
Whatever relationship between Requirements and Solution that we model and train determines what our AI can deliver. E.g. Given a “Requirements Documentation” input of 2 structured tables and 3 prose text documents, predict YES or NO that product XYZ appears in the quoted solution. If the Product does appear in the quoted solution (meaning: YES prediction exceeds some threshold of confidence), then predict the value of each configuration parameter or add-on. Finally, predict the quantities.
So the deliverable consists of (1) a list of product-configurations and their quantities, (2) confidence of each prediction, (3) if confidence is below some threshold, alternatives.
Alternatives? Well, given the many edge cases solution architects encounter, it is unlikely each recommendation will be spot on. So there must be recommended alternatives to further reduce the time spent by solution engineers finishing the quote before it’s sent to the customer.
In the future, a 4th item can be included in the deliverable: acceptable discounts. Discounts can be recommended based on historical success patterns, alongside either learned or manually-input (rules-based) profit models. Discount suggestions based on historical win/loss patterns are already a fairly common AI-assisted function in many CRM / CPQ tools.
Getting started: Building structure from prose text
Any unstructured text document can be decomposed into a list of words by popping off substrings at each whitespace (“ “) character. Each word is called a “token”. Each word in a sentence can receive a part-of-speech tag (e.g. verb, noun, etc). Certain groupings of tokens can also be recognized as “known entities” by neural nets trained for a given context. E.g. Take the below recipe for chocolate chip cookies…
Whisk together 2 ¼ cups all-purpose flour, 1 tsp baking soda and 1 tsp salt in a large bowl. Add to a large bowl 1 ½ sticks (12 tbsp) room temperature butter, ¾ cups packed light brown sugar, and 2/3 cups granulated sugar. Beat on medium-high speed until light and fluffy, about 4 minutes. Add 2 large eggs, one at a time, beating after each addition to incorporate. Beat in 1 tsp pure vanilla extract. Stir in 12oz chocolate chips.
Below is the same paragraph again, but this time highlighting different categories of entities in a baking context (items, quantities, units, and actions). Note: entity-recognition typically occurs after part-of-speech tagging, using POS tags as inputs.
Whisk together 2 ¼ cups all-purpose flour, 1 tsp baking soda and 1 tsp salt in a large bowl. Add 1 ½ sticks (12 tbsp) room temperature butter, ¾ cups packed light brown sugar, and 2/3 cups granulated sugar. Beat on medium-high speed until light and fluffy, about 4 minutes. Add 2 large eggs, one at a time, beating after each addition to incorporate. Beat in 1 tsp pure vanilla extract. Stir in 12 oz chocolate chips.
Here’s one way we could structure the recognized entities into a table, where hierarchical groupings are encoded in a Group_id.
|1||Whisk together…in a large bowl||Action||10010|
The bigger goal is to construct a table of complete conceptual units from those recognized entities. E.g. “2 ¼ cups all-purpose flour” is a conceptual unit composed of 3 recognized entities: Quantity, Unit, and Item. This conceptual unit is a member of a list of such conceptual units, grouped together with an action.
Here’s an example representation of the conceptual units in the paragraph, with some concepts interchangeable with others, using the alt_concept_id field.
|1||10010||10010||Whisk together…in a large bowl|
|2||10010||10011||2 ¼ cups all-purpose flour|
|3||10010||10012||1 tsp baking soda|
|4||10010||10013||1 tsp salt|
|6||10020||10021||1 ½ sticks room temperature butter||5|
|7||10020||10022||12 tbsp room temperature butter||4|
|8||10020||10023||¾ cups packed light brown sugar|
|9||10020||10024||2/3 cups granulated sugar|
Most rows in the table represent an ingredient (or a tool) required for the recipe. This serves as a reasonable analog to the requirements embedded in our client’s Requirements Documentation. The next step is to build a relationship between a structured set of requirements and a corresponding set of products comprising a quote.
Summarizing the steps:
- Tokenize the raw text string
- Cleanse the tokens (remove stopwords, lemmatize, etc.)
- Tag tokens with part-of-speech
- Named Entity Recognition (NER) neural net, custom-trained on a large, contextually-appropriate corpus (in this case, baking recipes).
- NER, Part II… grouping entities into conceptual units.
- Insert recognized entities and groupings (concepts) into a relational database schema for manipulation.
Preprocessing Data Sources
With any data source, there’s always a pre-processing pipeline. We run stats to see which fields are underpopulated, then cross-correlate to find overlaps of missing data, then determine whether to drop or impute. Numerical data is typically normalized. Outliers are identified and removed. Categorical data is standardized and encoded, either numerically (if it’s a scale) or 1-hot. E.g. good, better, best could be encoded as a numeric value (1, 2, 3) because there’s a quantitative meaning embedded in the categories, while red, blue, green should be encoded as red = 1 or 0, blue = 1 or 0, green = 1 or 0.
Unstructured text can generate tens of thousands of 1-hot-encoded features: one for each instance of a conceptual unit found amongst thousands of documents. This volume of features can easily overwhelm the relatively small contribution of features by the structured data sources. Thus, features generated from the unstructured data sources must be aggressively trimmed. We can use Term-Frequency, Inverse Document-Frequency (TF-IDF) among other techniques to identify which features are most valuable to keep.
Assessing the Feature Set
Once each set of Requirements Documentation is encoded, trimmed, and joined into a single vector, we run stats to see which features correlate most significantly to the predicted output. Using our a priori knowledge, we may determine that some features are strongly correlated to the outcome, but it’s because they actually contain information about the future, added by the solution architect after the solution was generated. Or we may find that documentation templates changed between 2017 and 2018, which inverted the encoding of a strongly correlated feature. For troublesome features that are worth keeping, we will adjust our preprocessing code. Noisy or less valuable features we’ll trim.
Whenever we retrain the model with more recent data, we should reevaluate the utility of the raw feature set (after initial trimming using TF-IDF), as some features may rise in prominence as the data set grows. Our goal must always be to maximize the signal and minimize the noise. Whatever features we feed to the model, those are the patterns it will learn, right or wrong.
How to Train your Model
Netflix’s “Recommended for you” section assigns probabilities to the movies/shows in its library based on movies/shows you watched previously. Your history is used to custom-tune a standard recommendation engine trained on thousands of such histories belonging to people with a similar movie-watching profile. We do something similar when “recommending” products. The features extracted from requirements documentation are like the viewing history, from which we can assign probabilities to each product that it will be present in the quoted solution. And we learn this pattern from thousands of prior examples of requirement feature sets and their associated solutions. And by learn, I mean we take one example at a time and iteratively adjust the probability weights that a given observed requirement feature should yield a given product recommendation.
Simple Example to Illustrate Model-Fitting
Take the below table of inputs, x, and outputs, y.
The linear function y=2x pretty accurately represents the data in this table. And If you were asked to predict a value for y, given some x, that function would come in handy. But y=2.07x+0.888 better minimizes the total error between the line and the data and will probably give you a prediction closer to the correct value.
That function is the result of an iterative algorithm that repeatedly tries every more optimal values for the slope and y-intercept parameters until the error stops changing (i.e. it converges on a minimum value).
Naturally, the relationship between a list of Requirements and a Quoted products list is more complex, so we need a more robust function than y=mx+b. However, we’ll use the same iterative approach to minimizing the error in that function.
Representing the relationship between X (Requirments) and Y (Solutions)
The best representation of the relationship between X and Y can vary in complexity from a simple linear function (y = mx + b) to a polynomial function (y = ax2 + bx + c) to a mixture of Gaussian probabilitiesfunctions to a broad & deep network of weighted activation functions (imitating the neural networks in biological brains).
The greater the complexity of the representation, the more data you needed to train the model. Increased complexity means more parameters, which we gain increased precision, but only if we can accurately tune the parameters. Yet with only a few data points, we must rely heavily on approximations and averages – and the edge cases are rarely seen, so it’s impossible to draw statistically significant correlations between inputs and outputs in those cases. With millions of data points, on the other hand, even the rarest edge cases can be represented with some confidence.
In our case, we have hundreds of data points and thousands of features per data point, so a neural net is an appropriate representation — with n features as inputs and m outputs as probabilities of a product present. Even better, it’s a representation the corpus can grow into overtime, as more Requirement-Solution pairs are generated.
Considerations for PROD
As soon as a data pipeline is built and deployed, it’s obsolete. Requirements gathering processes, standard documents, products, etc. will all change. And often, the changes are already underway when the model is finally deployed to PROD. The existing pipeline may continue to yield a productive, useful model for some time, but there must be a plan for continuously “re-skilling” AI models, just like employees. The good news is that well-designed AI models can be re-trained and re-deployed MUCH faster than their biological counterparts.
A Couple Exhortations for CTOs/CIOs
- All historical data has the potential to be a treasure hoard, even unstructured text documents. We now have the tools to extract structure from natural language, audio, images, and even video. And we can use that data to train knowledge representations which can be used to automate intelligent behavior which augment our human capabilities.
- The companies of the future are putting the work in today to design their IT infrastructure and data lakes and streams for building these intelligent automations. Other companies will be left behind.
In part 3, I will discuss the stages of data science implementation, from preparing your data infrastructure to productionizing AI apps based on ML services or custom-trained models.
About the Author: Eric Nelson
Eric Nelson is a Senior Software Engineer with MS³, Inc., living in Minneapolis, Minnesota with his wife Alisa, 7-year-old daughter Freyda, and 5-year-old son Arthur. Eric received his Bachelor of Science in Electrical Engineering from University of Minnesota and worked in photovoltaics and thin films. In 2015, he founded his own cloud software consulting firm where he trained dozens of student interns in software design and development skills. He is now focused on building large-scale, secure, futureproof production AI applications and software systems for smart brands as a member of the MS³ family.