Merchant Identification, often called Merchant ID lookup, enables you to understand who the merchant involved in a financial transaction is.
For hundreds of app developers out there, this is an important problem to solve as they work with personal-finance management apps, or similar use-cases. It’s the kind of problem that can turn a simple app into a big win, if solved correctly, or vice versa.
To track a Merchant ID is an effort-filled process that starts at data collection: You must have access to the transaction details. Banks allow consumers to share these pieces of data with some companies, like Pentadata, that can in turn make it available to consumer-facing apps. So, the first piece of the puzzle needed is access to the raw data.
From the raw data, you need to find the merchant transaction identifier. This usually appears as a confusing string of characters and digits that supposedly gives information about the merchant.
From there, your next and final objective is to find the merchant name and, if possible, their location.
Clearly, this last part is the most complicated of the entire process. There are different strategies that are worth trying, and we are going to review them next.
The most straightforward way to solve this problem would be to get a long list of Merchant ID numbers. Even though you can’t have all of them at once, you could keep an initial list on file and grow it with time.
If this was possible (and it isn’t), then the Merchant ID problem would be highly simplified (although not simple).
Every time you’d need to find a merchant name, you would find the best match comparing your Merchant IDs list with the transaction identifier that you’re currently analyzing. What does “best” mean in this case is yet another open question. The core of this last point is actually a substring matching problem that, as it’s well known, is not an easy one to solve.
Unfortunately, that is not possible. The reason is that a Merchant ID is considered sensitive information that merchants obtain from their payment processor, or bank, and should not be disclosed lightly. This Quora thread summarizes well the topic: getting a list of Merchant IDs is not possible.
With that option not available, the next best thing you can do is a reverse merchant ID lookup: starting from the raw data, that is, the merchant transaction identifier, you want to find the Merchant Name, and ideally also business details such as the address.
Let’s discuss that next.
A reverse lookup in this context means changing the input data: instead of starting from the Merchant ID, you start from the raw data that you have obtained and want to find the merchant name and, if possible, address.
Here are a few examples of what the raw data may look like.
In the past year I have looked perhaps at a hundred thousands of these textstrings, and they all share one property: They are messy.
In some cases, if it’s a human eye looking at it, the merchant name can be understood. It’s not so easy for a machine though.
In many other cases, characters and digits appear to be randomly placed in the textstring, to the point where it really doesn’t make sense. Like this one: “Online Payment 99374585676 To Sou”.
Payment to “Sou…” what? Oh, yes, I forgot to mention that the strings are often cut abruptly.
Real-world data is messy and more often than not the algorithms to make some sense out of the data are complex. So, we are back to the string matching problem because, even though it isn’t easy, it seems like the best bet in this case.
Because of how diverse the data is, and because of how diverse each application is, I can’t give you one single strategy that works for every case.
You will have to try different strategies and see what works best. In all cases, one thing seems clear: This is not the case when you can use an exact-match algorithm, because the input string is never clear enough. I know I’ve shown you eleven samples only, but believe me, it’s never clear enough.
So you have to rely on fuzzy-match algorithms, or, another good option, string-distance algorithms.
Here are a few options worth trying, though the list isn’t meant to be comprehensive. Looking into these will also be helpful to understand the next sections in article.
I have tested these and more algorithms on a lot of transactions’ raw data. My conclusion is that with these approaches, or a combination of them, it is possible in a decent number of cases to find the merchant name. The accuracy is not very high, expect it to be 70% with a 95% CI, but it is something—and more than you get otherwise.
There are of course services that provide it off-the-shelf. For instance, Mastercard has a web API that uses string-distance algorithms against their database to find and return the best match.
However, these approaches practically never work to identify a merchant, that is, finding the merchant’s name and their address.
We started working on this problem in January 2019. It took us one week to understand that the problem was in fact much more difficult than we thought. It then took us almost two years to get where we are, at 97% accuracy (95% CI) for identification of merchant name AND address. And we are not done yet.
The key was to look much further than string-matching. Since inception, the Pentadata system has processed more than 1 billion data points related to consumer-permissioned financial transactions (the whole system, not just MerchantSight). Data is permissioned by the consumer and anonymized and, if you’re wondering, we are compliant with the highest security standards (check our website!).
We built a data pipeline to ingest them all, and to ingest real-time data as well. At the far end of the pipeline there’s a statistical model that does inference (an AI-model, if you like trendy names). The model was initially quite large. We compressed it, then compressed it a bit more so it can now service requests in real-time, with a 95% percentile of response time in 500ms-1s.
Over time we also started accumulating data that was undeniably correct, which now forms a database of more than 80 million correctly-identified merchants that we use to preprocess every input as well as to benchmark new releases of the AI model.
We use this service internally a lot, for a number of reasons. Then, we developed a small stateless HTTP API around it, so now it’s very easy to use it externally too. It’s just like this:
POST /merchants/sight{ "textstring": "POS DEB 1437 07/08/20 WAL- CHAM Pa Card #9264"}
Output:
{ "name": "WALMART", "address": "1730 Lincoln Way E, Chambersburg, PA 17202"}
The input data can come from any data source, not necessarily from our Transactionz product.
Lots of companies are using our MerchantSight to the point that last month we (happily) had to increase the service capacity to make sure every customer enjoyed a good experience.
Furthermore, the service is available on a pay-as-you-go model, so you can get a sample of good size to make conclusions for a few dollars. I don’t recommend making any conclusions without testing at least 1000 independent samples.
If you are interested in trying MerchantSight, the easiest way to do so is to sign up to the sandbox environment where you can test 5 text strings for free and see how it works. If you want to read more about how that works with open-banking, read our post “How to Start Card-Linked Offers with Open Banking.”
By the time you do that, the model accuracy will be even higher.
Get the latest on open banking, consumer credit, and financial data quality.