Finding realistic data for testing is often a headache, and a good strategy is often to fabricate it. But what if your randomly generated data turns out to belong to a real person? What if they complain and you get fined 4% of global turnover?!
I am of course referring to the new GDPR regulations which will come into force next year. The aim of them is to prevent corporations from misusing our personal data, which generally seems like a good thing. Many companies use copies of production databases for testing new versions of their software. Sometimes that means testers and developers get access to a lot of data that they probably shouldn’t be able to see. Sometimes it means you’re testing with old and out of date versions of people’s information, or even testing with customers who cancelled their contracts with you long ago.
Generally I’m in favour of tightening up the rules around this, since I think it will better protect people’s privacy. I think it will also force companies to improve their testing practices. In my experience it’s easier to get reliable, repeatable automated tests if each test case is responsible for creating all the test data it uses.
If you instead write a lot of tests assuming what data is in the database from the start, maintaining that data can get really expensive. I’ve seen this happen first hand - at one place I worked, maintaining test data became such a big job that a whole test suite had to be thrown away along with the data!
At my last job, part of the test strategy involved data-driven testing. In my test code I’d access an internal API to create fictitious companies in the system. Then, simply by varying the exact configuration of these companies, I could exercise many different features of the system under test. Every case test created different, unique companies. This meant you could run each case as many times as you liked. You could also run lots of tests together in parallel, since they all worked with different data. Each time you ran the tests, you had the opportunity to start over with a fresh, almost empty database - a small memory footprint that meant a fast system and speedy test results.
When I came to my current client I was planning to use a similar approach. The difference here is that instead of fictitious companies, I needed to create fictitious people. So, I got out my random number generator and started creating fictitious names, addresses, and, of course, Swedish Personal Numbers. That’s when things started getting tricky… Let me introduce those of you who aren’t Swedish to the Personal Number.
It’s a bit like a social security number, unique for each person, and the government and other agencies use it as a kind of primary key when storing your data. If you know someone’s number, (and you know where to look), you can find out their official residence, how much tax they paid, credit rating, that sort of thing.
You’ll understand why, then, we started getting worried about GDPR at this point. A Swedish personal number is definitely a ‘personally identifiable’ piece of data, and GDPR has a lot to say about those.
A primary key containing your age
The other thing about the personal number is that it’s not just a number. When they first came up with the idea, in the 1940s, they thought it would be a great idea to encode some information in this number. (I’m not sure database indexing theory and primary keys were very well understood then!) So, they decided that the first part should be your birthdate, then three digits specifying where in Sweden you were born, and the last figure is both a checksum, so numbers can be validated, and also specifies your gender.
These days, lots of Swedish people are born outside Sweden, and not everyone has a gender that matches the one they had at birth, so they had to relax some of these rules. Sometimes too many children are born on one day and they run out of numbers for it, in which case you might get assigned the day before or after. However, the number still always includes your birth year, and hence your age.
Never ask a woman her age, (just get her personal number)
These days it’s quite common for retail chains to ask you to join a ‘customer club’ so you can accrue points on your purchases. When you buy something the assistant will often ask for your club membership number. Almost always the membership number is the same as your personal number, so you end up telling them (and everyone in the queue behind you) exactly how old you are. It’s disconcerting, if not rude!
Fictitious Test People get real and sue you
Back to my test data problem. I’m generating fictitious people, but the system under test both validates the checksum, and uses the personal number to determine their age. I have to generate realistic numbers from the last hundred years or so. This is where things start to get tricky. Someone in the legal department objected to this. He told me that if I was to generate a number belonging to a real person we could really get into trouble because of GDPR. If that person discovered we were using their personal number, and had changed their name to “Test Testsson” living at “Test gata 1, 234 56 Storstad”, that might not be too bad. Unfortunately, if we were using them for a tricky scenario, they might also discover our test database linked their personal number to a dreadful credit rating and a history of insolvency. At that point they might start to get insulted. In the worst case, they could sue us and GDPR would mean we’d have to pay them an awful lot of money!
Keeping fictitious test people imaginary
However unlikely that scenario might be my client is understandably unwilling to take the risk, and I really need to restrict the test system to genuinely fictitious personal numbers. The authorities do provide a list of these exactly for testing purposes, but the trouble is there are only about 1000 numbers on the official list. With the kind of automated testing I’m planning to do that is not enough. Not nearly enough! Assuming my system has about 500 features, I want to write 5 scenarios per feature, I have 25 developers running all the tests 25 times a day… I need something like half a million unique numbers. Less if I can wipe the db more often than once a day, but still. A thousand doesn’t go very far.
So, here’s a startup idea for you - I’m pretty sure my client isn’t the only company out there who would pay good money for a list of a million personal numbers, for people aged less than 100, that are guaranteed not to exist in reality. If you have such a list, and would be willing to maintain it for me such that it remains guaranteed fictitious, please get in touch!
Playing with time
Right, so I’m back on generating personal numbers again, not having found that list yet. I have a couple of ideas about how to keep them fictitious though. The first idea is to use really old numbers. While I’m testing the system, I can set the clock back to 1900, and only use generated numbers from 1800-1890. Since everyone over 125 can be assumed to have passed on, they are unable to sue us. For some scenarios I need under 18 year olds, and by changing the date that the system clock thinks is ‘today’, I can generate children born in the 1870s. Fictitious children complete with bank accounts, mobile phone contracts, student debt, acne… and everything will work just fine!
Hack the checksum
The other idea I had was to use personal numbers with invalid checksums. After the date of birth and the three random numbers, the last figure in the personal number is calculated from the others, using the Luhn algorithm. While I’m testing the system, I can modify the ‘PersonalNumber.validateChecksum()’ function to instead accept all invalid numbers, and reject all valid ones. Then all my tests can generate perfectly invalid personal numbers, and the test database will get filled with people that definitely can’t exist in reality.
As with any well-designed system, there is only one place in the code where I calculate the checksum, so it’s actually a really small change just for testing. It also comes with insurance - if I ever accidently deploy this code change in production, absolutely everything will stop working, since all those valid numbers already in the production database will suddenly get rejected. I think we’d notice pretty quickly if that happened!
Is there a better way?
I haven’t actually implemented any of these strategies yet - we’re still at the planning stage. I’d be very interested to hear if anyone else has solved this problem in a better way. In the meantime, I’ve written some code that can generate personal numbers with and without valid checksums, and also lets you specify an age range - it’s available on my github page.
We can’t be the only people worried about GDPR and personal numbers. That threat of a fine of 4% of global turnover is a pretty mighty incentive to focus some effort on cleaning up test data.