Fictitious people that might get real - and sue you!
Finding realistic data for testing is often a headache, and a good strategy is often to fabricate it. But what if your randomly generated data turns out to belong to a real person? What if they complain and you get fined 4% of global turnover?!
I am of course referring to the new GDPR regulations which will come into force next year. The aim of them is to prevent corporations from misusing our personal data, which generally seems like a good thing. Many companies use copies of production databases for testing new versions of their software. Sometimes that means testers and developers get access to a lot of data that they probably shouldn’t be able to see. Sometimes it means you’re testing with old and out of date versions of people’s information, or even testing with customers who cancelled their contracts with you long ago.
Generally I’m in favour of tightening up the rules around this, since I think it will better protect people’s privacy. I think it will also force companies to improve their testing practices. In my experience it’s easier to get reliable, repeatable automated tests if each test case is responsible for creating all the test data it uses.
If you instead write a lot of tests assuming what data is in the database from the start, maintaining that data can get really expensive. I’ve seen this happen first hand - at one place I worked, maintaining test data became such a big job that a whole test suite had to be thrown away along with the data!
At my last job, part of the test strategy involved data-driven testing. In my test code I’d access an internal API to create fictitious companies in the system. Then, simply by varying the exact configuration of these companies, I could exercise many different features of the system under test. Every case test created different, unique companies. This meant you could run each case as many times as you liked. You could also run lots of tests together in parallel, since they all worked with different data. Each time you ran the tests, you had the opportunity to start over with a fresh, almost empty database - a small memory footprint that meant a fast system and speedy test results.
When I came to my current client I was planning to use a similar approach. The difference here is that instead of fictitious companies, I needed to create fictitious people. So, I got out my random number generator and started creating fictitious names, addresses, and, of course, Swedish Personal Numbers. That’s when things started getting tricky… Let me introduce those of you who aren’t Swedish to the Personal Number.
It’s a bit like a social security number, unique for each person, and the government and other agencies use it as a kind of primary key when storing your data. If you know someone’s number, (and you know where to look), you can find out their official residence, how much tax they paid, credit rating, that sort of thing.
You’ll understand why, then, we started getting worried about GDPR at this point. A Swedish personal number is definitely a ‘personally identifiable’ piece of data, and GDPR has a lot to say about those.
The other thing about the personal number is that it’s not just a number. When they first came up with the idea, in the 1940s, they thought it would be a great idea to encode some information in this number. (I’m not sure database indexing theory and primary keys were very well understood then!) So, they decided that the first part should be your birthdate, then three digits specifying where in Sweden you were born, and the last figure is both a checksum, so numbers can be validated, and also specifies your gender.
These days, lots of Swedish people are born outside Sweden, and not everyone has a gender that matches the one they had at birth, so they had to relax some of these rules. Sometimes too many children are born on one day and they run out of numbers for it, in which case you might get assigned the day before or after. However, the number still always includes your birth year, and hence your age.
These days it’s quite common for retail chains to ask you to join a ‘customer club’ so you can accrue points on your purchases. When you buy something the assistant will often ask for your club membership number. Almost always the membership number is the same as your personal number, so you end up telling them (and everyone in the queue behind you) exactly how old you are. It’s disconcerting, if not rude!
Back to my test data problem. I’m generating fictitious people, but the system under test both validates the checksum, and uses the personal number to determine their age. I have to generate realistic numbers from the last hundred years or so. This is where things start to get tricky. Someone in the legal department objected to this. He told me that if I was to generate a number belonging to a real person we could really get into trouble because of GDPR. If that person discovered we were using their personal number, and had changed their name to “Test Testsson” living at “Test gata 1, 234 56 Storstad”, that might not be too bad. Unfortunately, if we were using them for a tricky scenario, they might also discover our test database linked their personal number to a dreadful credit rating and a history of insolvency. At that point they might start to get insulted. In the worst case, they could sue us and GDPR would mean we’d have to pay them an awful lot of money!
However unlikely that scenario might be my client is understandably unwilling to take the risk, and I really need to restrict the test system to genuinely fictitious personal numbers. The authorities do provide a list of these exactly for testing purposes, but the trouble is there are only about 1000 numbers on the official list. With the kind of automated testing I’m planning to do that is not enough. Not nearly enough! Assuming my system has about 500 features, I want to write 5 scenarios per feature, I have 25 developers running all the tests 25 times a day… I need something like half a million unique numbers. Less if I can wipe the db more often than once a day, but still. A thousand doesn’t go very far.
So, here’s a startup idea for you - I’m pretty sure my client isn’t the only company out there who would pay good money for a list of a million personal numbers, for people aged less than 100, that are guaranteed not to exist in reality. If you have such a list, and would be willing to maintain it for me such that it remains guaranteed fictitious, please get in touch!
Right, so I’m back on generating personal numbers again, not having found that list yet. I have a couple of ideas about how to keep them fictitious though. The first idea is to use really old numbers. While I’m testing the system, I can set the clock back to 1900, and only use generated numbers from 1800-1890. Since everyone over 125 can be assumed to have passed on, they are unable to sue us. For some scenarios I need under 18 year olds, and by changing the date that the system clock thinks is ‘today’, I can generate children born in the 1870s. Fictitious children complete with bank accounts, mobile phone contracts, student debt, acne… and everything will work just fine!
The other idea I had was to use personal numbers with invalid checksums. After the date of birth and the three random numbers, the last figure in the personal number is calculated from the others, using the Luhn algorithm. While I’m testing the system, I can modify the ‘PersonalNumber.validateChecksum()’ function to instead accept all invalid numbers, and reject all valid ones. Then all my tests can generate perfectly invalid personal numbers, and the test database will get filled with people that definitely can’t exist in reality.
As with any well-designed system, there is only one place in the code where I calculate the checksum, so it’s actually a really small change just for testing. It also comes with insurance - if I ever accidently deploy this code change in production, absolutely everything will stop working, since all those valid numbers already in the production database will suddenly get rejected. I think we’d notice pretty quickly if that happened!
I haven’t actually implemented any of these strategies yet - we’re still at the planning stage. I’d be very interested to hear if anyone else has solved this problem in a better way. In the meantime, I’ve written some code that can generate personal numbers with and without valid checksums, and also lets you specify an age range - it’s available on my github page.
We can’t be the only people worried about GDPR and personal numbers. That threat of a fine of 4% of global turnover is a pretty mighty incentive to focus some effort on cleaning up test data.
Do you have a tendency to use the backlog as an eternal placeholder? If so, you probably have a lot of clutter that’s creating a lot of frustrations for your end-users. In this post we’ll show you how to clean up your Jira issues and reduce the backlog with some basic JQL queries.
Tips to improve project management in the Atlassian suite
How to test Kubernetes artifacts like Helm charts and YAML manifests in your CI pipelines with a low-overhead, on-demand Kubernetes cluster deployed with KIND - Kubernetes in Docker.
Low overhead, on-demand Kubernetes clusters deployed on CI Workers Nodes with KIND
Had enough of sluggish polling? With instant Artifactory event triggers you can give responsiveness in Jenkins a real boost. Here’s an easy way to set it up.
A super easy configuration guide
With the arrival of microservices code is becoming disposable. Does this mean that we no longer need maintainable code? Is it the end of refactoring?
Still relevant or increasingly redundant?
In software development tight coupling is one of our biggest enemies. On the function level it makes our application hard to change and fragile. Unfortunately, tight coupling is like the entropy of software development, so we have always have to be working to reduce it.
How to safely introduce modular architecture to legacy software
I am an Atlassian certified trainer and over the years I have been spending much time with clients and their Jiras. In this blogpost, I have collected some small tips and tricks that will make your Jira usage better.
Jira Software is a powerful tool deployed in so many organizations, yet in day to day usage people are missing out on improvements, big and small.
In this post, I’ll take a closer look at the version of Jenkins X using Tekton, to give you an idea of how the general development, build, test, deploy flow looks like with Jenkins X. How does it feel to ship your code to production using a product coming from the Jenkins community that has very little Jenkins in it?
A crash course in Jenkins X and how to test it out on a local Kubernetes cluster
In this blog I will show you how to create snapshots of Persistent volumes in Kubernetes clusters and restore them again by only talking to the api server. This can be useful for either backups or when scaling stateful applications that need “startup data”.
Sneak peak at CSI Volume snapshotting Alpha feature
When I read Fowler’s new ‘Refactoring’ book I felt sure the example from the first chapter would make a good Code Kata. However, he didn’t include the code for the test cases. I can fix that!
Writing tests for ‘Theatrical Players’
Nicole Forsgren and the Accelerate DORA team has just released the newest iteration of the State of DevOps report. The report investigates what practices make us better at delivering valuable software to our users as measured by business outcomes. Read on for our analysis of the report, and how it can be best put to use.
The latest drivers of software delivery performance
Hear about upcoming events in Scandinavia, latest tech blogs, and training in the field of Continuous Delivery and DevOps