Prep your customer demo

Intro
Recently I had to prepare a demo for B2B customers. The requirement was clear: generate demo data. I go like “no problem, I’ll just generate some data I guess..” in my head. But is that really so simple?
The data
No matter what your product is, the demo is going to be at the very end of the entire product development process. So by then the idea was born and approved on management level. The designs were implemented, there were some busy coding hours and all these steps were about: we want to show some particular data in a format what we think is best fitting.
At this point to think that generating some data will do the trick is foolish and would ruin the hard work put into the previous steps.
The considerations
Always think about your realm first. If you need to demo some sales data, then think about what are you intend to sell. Christmas trees? Then probably peak sales in June would not be a brilliant idea for the demo dataset. Or does your demo showcase network traffic of your sports straming service? Then think about when the peak traffic is tipically happening. Probably not on a Tuesday at noon.
In my partiicular case I needed to demo some B2C and B2B sales data for touristic regions. The data I needed to generate was for regions, products, hotels, customers, bookings, refunds, resellers, stores, service providers and all the numbers (unit price, order price, ordered quantity, tax, discount, commissions, etc.) As you can imagine, the data generation is not a one-liner. There are some solid tools out there that can generate data like faker.js, mockaroo or fabricate which even genrates you an entire database schema and inserted data in seconds. Yet, I see these tools fitting for development and testing environments, not for customer demos. Because of the complexity of the data I needed to generate was too high and to mock who is selling which product to which customer at that price would be a nightmare, I decided to generate the data manually for my use case. (and probably would do it the same way again. no regerts.)
The data generation
Thankfully, I had a massive amount of already existing production data. The appropach I took was the following: Take existing data and change the values consistently, but anonymize all the data so no backtracing would be possible! If customer x had 10 orders, then in the demo data he will have 10 orders as well, but he will be customer y and all the product names and timestamps will be different, yet realistic. This approach will grant that a certain shop will always sell the same anonymized products and most importantly the timestamps will be realistic.
Here are the steps I took to generate the demo data:
- Copy the schema of the existing database tables and suffix them with
_demo
. For the sake of simplicity I will use a singlesales
table as an example.
CREATE TABLE `sales_demo` LIKE `sales`;
- Copy the data from the existing tables to the new demo tables with the probablity of 30% for each row.
INSERT INTO `sales_demo`
SELECT * FROM `sales`
WHERE RAND() < 0.3
AND `is_order` = 1 -- only orders for now
- Now for all the oders I needed to check if there are any related refunds. If there are, I needed to copy that data for them as well.
INSERT INTO `sales_demo`
SELECT * FROM `sales`
WHERE
`is_order` = 0 -- only refunds
AND `sales`.`order_id` in (
SELECT `order_id` FROM `sales_demo`
);
Now we have a mix of orders and refunds from different shops.
- Anonymize the data and keep the semantics for relevant columns. There are some columns where to keep the semantics is not important. Tipically any id columns, foreign keys, etc. As long as data can be referenced, it doen’t matter if it’s an INT or a UUID. Yet, keeping the original data would be unacceptable.
Some other columns are on the other hand important for the demo. For example mocking resellers, products, stores, service providers, etc. So here I needed to keep the semantics or introduce my own naming while anonymizing the data.
4.1. Anonymize id columns
I choose to update the id columns always with the corresponding speaking name. So if the original data was 1234567890
and the speaking name is Big Mountain Shop 42
, then the anonymized data will be Demo Shop 1
and a hashed id of 1234567890
.
Keyword: Hash function
UPDATE `sales_demo`
SET
reseller_id = CASE
WHEN reseller_id is not null THEN SHA256(reseller_id::bytea)
ELSE NULL
END,
reseller_name = CASE
WHEN reseller_name is not null THEN 'DEMO-RESELLER-' || SHA256(reseller_id::bytea)
ELSE NULL
END
[...]
This will make sure that the same reseller will always have the same anonymized id and name.
The SHA256
function is used to hash the data and the ::bytea
is used to convert the data to a byte array.
There is an another approach I choose for some cases when converting id and the corresponding string. When the MAX(id) was some low number, for example in case of shops there will be only a very limited number of them across the system. Then I would use the following approach Keyword: Window function
UPDATE sales_demo r
SET
store_id = d.new_store_id,
store_name = d.new_store_name
FROM (
SELECT
store_id as old_store_id,
ROW_NUMBER() OVER (ORDER BY store_id) + 100 as new_store_id,
'Demo Store ' || (ROW_NUMBER() OVER (ORDER BY store_id)) as new_store_name
FROM (
SELECT DISTINCT store_id
FROM sales_demo
) s
) d
WHERE r.store_id = d.old_store_id
This way I ended up with a limited number of shops in input and output data with a consistent naming like Demo Store 101
, Demo Store 102
, etc. and the id was 101, 102, etc.. It’s OK for this use case, becasue such IDs will repeat among different datasets anyway so no need to strive for uniqueness.
4.2. Anonymize data by keeping the semantics This one is a bit more time consuming. You need to fire some select queries with group by clauses to get a feeling of the data. Such case is for example the product name. What I did was the following: Step 1: select the product id, name and sku and the count of orders for each product.
SELECT
product_id,
product_name,
COUNT(*) as order_count
FROM sales_demo
GROUP BY product_id, product_name
Step2: decide if the product will make the cut as a seperate demo product or will it be merged to some generic product. For this a series of CASE WHEN…ELSE statements were used.
UPDATE sales_demo r
SET
product_name = d.new_product_name
...
FROM (
SELECT
product_name as old_product_name,
CASE
WHEN LOWER(product_name) LIKE '%drink%' THEN 'Demo Drink'
WHEN LOWER(product_name) LIKE '%breakfast%early%' OR LOWER(product_name) LIKE '%breakfast%late%' THEN 'Demo Breakfast'
-- but also more specific cases with more semantic meaning
WHEN LOWER(product_name) LIKE '%skipass%1%adult%' THEN 'Demo Skipass 1 Day - Adult'
ELSE 'Demo Product'
END as new_product_name
...
FROM (
SELECT DISTINCT product_name
FROM sales_demo
) s
) d
WHERE r.product_id = d.old_product_id
This might doesn’t seem wild, but here I ended up with partially over 100 WHENs to have a good mix of data.
- Search for data level anomalies like when due to some system or user input error a column which was expected to contain some specific value contains an another kind of value. Such cases had to be harmonized with the rest of the data. Good candidates for postgres regular expressions.
UPDATE sales_demo
SET zip_code = '1234'
WHERE zip_code !~ '^[0-9]+$'
I also changed the timestamps of every entry so not even based on timestamps you could identify original data points.
- The data is ready for the demo. There were quite a few more statements which I won’t cover here, but they all centered around the above considerations.
The results
In the end there was a total of around 1M rows in the demo dataset, nicely distributed over the period of 3 years, greatly showing some seasonalities and peaks in sales which you could argue are realistic and the result of a marketing campaign or just normal prodcut life cycle. As we developers often are, I somewhat underestimated the time to implement this, so in the end I’ve spent probably a day in total in multiple iterations and just tweaking the data here and there after I saw it represented in the application.
Takeaways
- Always think about your realm and needed data properties first
- Try to denormalize the data as much as possible
- Pick a main identifier for the data. Eg. in case of Spotify it’d could be an artist, in case of Strava a user and in my case it was a shop
- Create demo data based on the data you already have to save time and mock real data as close as possible
- Use hash or window functions to anonymize id columns
- Use CASE WHEN…ELSE statements to anonymize data by keeping the semantics
- Search for data level anomalies and harmonize them
- Change the timestamps to hide the original data points