How Can Machine Learning Be Used to Dedupe Your Salesforce Environment

For a human being, comparing two records together to see if they are duplicates is a fairly simple and straightforward process. However, if you have millions of records in your Salesforce environment, you can have thousands of duplicates that are causing all kinds of havoc in your sales and marketing processes. Since it would not be possible for a human user to manually compare and merge such an amount of records, there are many solutions on the AppExchange that do this for you. However as we will see later on, a lot of these apps have their own flaws since they are rule-based, that present the user with even more difficulties.

This is why using machine learning to dedupe your Salesforce is a much better alternative. Let’s take a deep dive into machine learning by first learning about how machines compare names and surnames to identify potential duplicates.

How Do Machines Determine if Two Names are Similar?

Machine learning researchers rely on various string metrics to compare different fields and determine to what extent they are similar. One of the most well-known string metrics is the Hamming distance which simply counts the number of substitutions that must be made to turn one string into another. For example, if you have “Smith” and “smith” the Hamming distance would only be 1 because only one substitution is necessary to turn one string into the other. This method works great if you have names like Stephen/Steven/Steve because the sequences are very similar, but this method would run into trouble with something like Steve/Stove. Even though the strings are also similar, there is absolutely no semantic relationship between the two.

Another interesting string metric is set similarity. This is the idea that if a name contains many of the same words, then the meanings of the names are likely to be similar. For example, if we take the name “James Fenimore Cooper” and “Jim Cooper”, these names are fairly far apart in terms of sequences of characters, but since the last name is the same, it may be referring to the same person. There are many other string metrics used to train machine learning algorithms, but the point is that by using all of them together, the system will be able to understand all of the similarities the same way that the human mind can.

Giving One Field Greater Emphasis Than Another

When deduping your Salesforce environment, the system will need to compare lots of different fields besides “First Name” and “Last Name”. When doing so, it needs to determine that certain fields are more important than others when checking for duplicates. For example, you may decide that the “Email” field should be given greater importance, or weight, than both the “Last Name” and “First Name” fields. However, exactly how much much more weight should it be given? Is it 3 times more or 2.7? A human user could never calculate so much data, which is why machine learning is so useful.

When you label two records as duplicates, the system “learns” from your choices and applies the same logic to subsequent records. This is called Active Learning and it saves you the hassle of having to input the necessary weights for each field for every comparison. It should also be noted that this approach is scalable. Basically, the system does not automatically compare one record to another. It is much too smart for that. Instead, it blocks together records that have some “thing” in common. For example, let’s say the records contain the following names:

  • Robert
  • Rob
  • Roberto
  • Roberta

All of these names have the same first three letters, so the system would block them together since they have similar attributes. Even though there may be many blocks, the number of comparisons that need to be made is a lot smaller than comparing every record you already have with the new ones you would like to import.

Why is Machine Learning Better Than the Traditional Rule-Based Approach?

Imagine you identify a duplicate record in your Salesforce and let your Salesforce know about the problem. They will examine the issue and create a rule to prevent the duplicate from occurring in the future. You will have to repeat this process over and over again if you are using a rule-based tool. This is not a good use of your admins’ time and is simply unsustainable. Think about how many different types of “fuzzy” duplicates are there. It would simply be impossible to create a rule that accounts for every single scenario.

With the machine learning approach, all of the work is done for you. Since the system learns on its own to identify duplicates, there are no complicated rules to set up. You would simply be able to install the product and start using it right away. It is a much more comprehensive approach towards improving the health of your data than the rule-based tools, that will give all stakeholders greater confidence in your data.

Try Using Machine Learning to Dedupe Your Salesforce Environment

Even though the machine learning approach is a lot different than the transitional rule-based tools you may be used to, it is a lot simpler than you think. The deduping algorithm is completely customizable to fit your individual needs and it will give a much greater ROI than the tools you may be currently using. Therefore, I highly recommend switching over to the machine learning approach since it will save you a lot of time resources, and headaches.

You May Also Like