Data Tokenization

What is tokenization and how to choose tokenization service provider

7 min readJul 1, 2021

Imagine all the information your bank has on you. Along with information about your Credit Cards, it has lots of your personal data. Data, that in the hands of a 3rd party can be used to single you out. Such information is often referred to as Personally identifiable information (PII). This includes your name, address, cell phone number, Social Security Number, etc.

Tomorrow, some rogue employee can download all data and sell it to some marketing company. Worse, some hacker may sell it to someone who can take out a large loan on your name and vanish.

With such sensitive data in your platform, how can you protect the data beyond hiding it behind passwords and access controls?

One of the ways to achieve safety of PII is through tokenization.

So, What is Tokenization?

Tokenization is a process where a piece of PII is mapped to fake non-sensitive data, referred to as a token, that has no exploitable meaning or value.

Example, John Doe can be replaced by Jane Lee. If anyone inspects the data, the person will never know that the name Jane Lee actually means John Doe.

Tokens can be generated through reversible algorithms or static tables mapped to randomly generated token values.

Ok, but isn’t tokenization just a fancy name for encryption?

Theoretically, both are a form of Cryptography but practically they are different methods for data security. Let me try to highlight the difference

Encryption:

The function applied on data is PUBLIC but the key is private.
Format and length of the given data is not preserved unless some format preserving encryption are used that are slow in nature. It looks like a garbage data. The appearance and length of the original data might change and hence if the user application has length validation in place, then encrypted data might be of no use.
Entire thing gets encrypted for e.g. encryption(John Doe) => Gq+I+8t4I/GgHMylzbNU=
Encryption make sense for unstructured data like images, audio files etc.
The result of encryption(encrypted value) remains constant in case of deterministic encryption. i.e. if the function f is applied multiple times, on a given data set x, the encrypted value will always be y and will never change but it’s not necessarily the case incase of non-deterministic encryption.

Tokenization:

The function(proprietary algorithm) applied on a data is PRIVATE along with its key
The result of tokenization(tokenized value) is not constant (depends on the type of tokenization). i.e. if the function f is applied multiple times, on a given data set x, the tokenized value can result into y or z as it can vary each time. for e.g. tokenization(John Doe) => Jane Lee in first attempt or tokenization(John Doe) => Ford Ray in second attempt
Format and length of the given data is preserved. It looks like the real data. The appearance and size of the original data won’t change and hence it won’t impact the user application
Partial tokenization is possible for e.g. tokenization(John Doe) => Ford Doe (last 3 characters are real value)
Tokenization is useful for structured data like csv, json etc.
Biggest advantage of tokenization is to use production data in lower environments for testing.

Quick summary:

encryption vs tokenization comparison on various parameters

Enough! Tell me more about types of tokenization

There are two types of tokenization

Vault Based
VaultLess

Vault Based

Tokenization:

Tokens are created in two steps

Apply proprietary algorithm on actual data
Persistence of the mapping of actual and tokenized value

Let’s try to understand with an example

Assume tokenization server has simple tokenization mapping rules as displayed in the picture. So now, if the algorithm is applied on actual value x, y or z the outcome could be anything from the token set i.e. a to f. Let’s say the algorithm is applied multiple times on actual value x then outcome could be a in the first run and it could be b as well in the second run. Once the token is generated, it will try to persist the mapping of actual and token in the database lookup table. If either of the actual or token value or combination of both is present in the lookup table then it might generate the token again till it finds a unique one before persisting. Hence this will ensure that database lookup table will always have a unique combination in order to support reverse lookup.

Note: This data set is just for demonstration purpose and it will be complex and large in real world

Look up table is stored inside the highly secured Vault server.
Time complexity: Assuming the indexes are in place, the time complexity is O(n) where n = number of records in the database.

De-Tokenization:

While retrieving the actual value, it just does the reverse look up in the database table for the tokens. It does not apply the inverse tokenization function. for e.g. for token ‘a’, the actual value would be ‘x’.
Time complexity: Assuming the indexes are in place, the time complexity would always be O(1)

Concerns?

Performance: Look up can become expensive if the list of mapped values grows over a period of time. More the number of mapping, the slower the lookup is.
If the vault key is compromised, all the actual values are exposed.
if the vault based tokenization server is hosted in various regions then it requires an expensive synchronization capability in order to keep it out of the loop of collision.

VaultLess

Tokenization:

Let’s try to understand with an example

In contrast to vault based, here each actual value will have its own unique token set to which it will be mapped to (as described in the side picture). The values token in token set cannot intersect each other. for e.g. if the algorithm is applied on actual value x, then the outcome could be a or b and if applied on y, it could be c or d. But it can never happen where outcome for x will be c, d, e or f as you can refer in the below table

It does not persist the mapping of actual and token values in the lookup table. Hence it has an edge over Vault based as the Performance is fast.
Time complexity: O(1) as it will always be constant.

De-tokenization:

Since VaultLess doesn’t have a lookup table, it applies the inverse algorithm on the token to fetch the actual value. Now, you can relate why each token set it is unique for actual value. If the token set values intersect, then this concept won’t work.
The actual value will always be the same irrespective of the number of times the inverse algorithm is applied.
Time complexity: O(1) as it will always be constant.

Concerns?

Unless the proprietary algorithm is compromised, it is considered safe as compared to Vault Based

Quick Summary

Vault based vs Vault less tokenization comparison on various parameters

Points to consider while choosing the tokenization provider

Developers/Quality Analysts should consider the following points while choosing the right tokenization solution provider.

Token Generation capabilities:

Should support bulk token generation for data bundled in CSV or batch so that round trip time between calling service and tokenization server can be reduced.
Should support token generation for JSON based requests.
It should be able to generate tokens for the different data types for e.g. alphanumeric, numeric, Date and boolean.
It should be able to generate tokens for multibyte characters with accented characters (for e.g. umlaut characters) by preserving appearance, length and delimiter.
Some PII fields like surname can be two or three letters (for e.g. Yu or Lee), it should be able to generate tokens for those by either padding or some other mechanism. Reason being de-tokenization of two characters can be hacked.
Should support the prefix or suffix for an application to identify the multi region tokenized data so that it can be routed to the appropriate region for de-tokenization.
Should be able to perform partial tokenization for e.g. for a given 16 digit credit card number just tokenize first 12 numbers and keep the last 4 digits real.
Should be able to provide VaultLess tokenization service.

Performance:

Should be able to generate thousands of tokens fast and secured else this could become the biggest bottleneck.
Should be able to handle the request coming in MBs and should provide the response faster.

Scalability:

Should be scalable if the number of connections that performs tokenizations and not just by throwing a hardware to existing tokenization server farm.

PCI and CCPA/GDPR compliant:

Ensure provider understand PCI and GDPR compliance, security, and the risk.