AI 101, Part II: How to Deal with Data Preparation
May 9, 2017
This article was originally published on MarTech Series by Sean Zinsmeister, Vice President of Product Marketing at Infer.
My first post in this series covered what marketers and sales leaders need to know about the four main phases of building predictive models. The second of these steps – data preparation – tends to be the least understood part of AI and predictive analytics in marketing. In this next post, I’ll dig deeper into key considerations surrounding this process, namely related to data volume and data quality. When my company introduces our predictive platform to companies, two of the biggest concerns we hear are: (1) Do I have enough data? and (2) Is my data “clean” enough?
HOW MUCH DATA IS NEEDED FOR MACHINE LEARNING?
There’s a rule of thumb for how much data you need in order to be successful with a predictive model, and the most important number is the amount of positive signals or “good” examples there are in your data set. In the case of historical customer data for lead or account scoring, this would be how many total opportunities or closed/won deals you have in your CRM database.
Of course, these positive signals exist among other negatives. Make sure your positive is defined as a relatively significant achievement in the pipeline. For example, the creation of an opportunity is a meaningless milestone if it happens for every single free trial that comes in. Instead, consider going further down the funnel to find a tougher hurdle that really points to lead quality.
Predictions will be most accurate when you have around 400 to 500 of these positive results. In that range, they can be randomized and split into two proportions (60% and 40%) for model comparison. If you have fewer than a hundred examples to test your model over, your results won’t be quite as precise as you might want (until you add more data over time and refresh the model).
HOW AI SOLVES THE DATA HYGIENE PROBLEM
The truth is that no business has perfect quality, complete data, but that’s okay. Modern data preparation techniques are built to work around that very problem, so there’s no need to delay AI initiatives while you wade through cumbersome data clean-up projects. If you do, you’ll just leave revenue opportunities on the table. By matching whatever limited lead data you have with hundreds of external signals from the web, predictive platforms like Infer can build a complete picture of each prospect or customer. In fact, our algorithms can produce lead scores with nothing more than a company name or an email address. That’s thanks to advanced data science approaches like Natural Language Processing (NLP), which can bridge gaps in your data by looking for patterns in the web crawls, performing title normalization and doing spam analysis on form input.
Anyone who has sold into IT or the sales and marketing industry knows that job titles are all over the place (or sometimes not included in the data at all). Title normalization techniques tend to be especially important for lead fit models because you need to know that “Marketing Director” might be equivalent to “demand gen lead,” or that “IBM” and “International Business Machines” are the same company. NLP essentially splits out each word that exists across all of your records and uses an algorithm to assess related patterns and find the words that show up most often in positive outcomes for a particular data set.
Another sophisticated feature to look for is spam analysis – something that’s often used in consumer search algorithms like Google. By analyzing the number of capitalized characters and key input for a name, company, title or email, you can assess the likelihood that each data point is a legitimate input. For example, the way a person’s fingers traveled across the keyboard (i.e. the number of row switches, etc.) often indicates whether their entry is legitimate. An email like firstname.lastname@example.org doesn’t travel very far and is probably not a real address. Machine-learning can perform these checks on every single record, regardless of whether or not it matches a known website domain.
As you can imagine, NLP alone can help you immediately improve your data hygiene. That’s why, instead of doing months of data cleansing first in hopes of being able to get better intelligence, later on, it’s smarter to get your predictive and AI initiatives started now, with the data you have. There’s no sense in spending time and money augmenting fields and cleaning up data that isn’t helpful for your models anyway. Rather, use machine-learning to figure out what your most important data points actually are, and then focus your data cleanup efforts there as needed.
It’s so important to understand common data science methodologies like these as you move forward, even if you never intend to work with the algorithms yourself. This knowledge will help you spot any flaws, unrealistic expectations, assumptions, and missing pieces in predictive and AI solutions so that you can thoughtfully evaluate them. In my next post, I’ll expand further on basic model types and more problems sales and marketing teams can solve with data.
Transform Your Pipeline Today
See Firsthand How Infer Uses Your Own Data To Create Custom Scoring Models