Set up your data with integrity in tact
Have you ever wanted to share a spreadsheet or dashboard with someone, but you’re afraid to share the real data because of privacy issues? When I create spreadsheets and dashboards, I’m often asked to share my data with other teachers, however, I can’t because of the real data. I need to create a sample dataset that still maintains the integrity of the actual data. Keep reading below or check out the walkthrough video here.
The first thing I do is look for the identifiers. Those are the parts in the data that must be substituted with something else.
Identifiers can be data such as student ID numbers, real names, school names, phone numbers, email addresses, date of birth, citizenship, photos, and so on. I then make a duplicate of the data set in another tab. If the file is imported from a CSV, I will just edit that tab because it is static data.
Adjust those identifiers
If you want to change something such as the Student ID number, one very simple way is to use the SEQUENCE function to create a column of unique values. You can decide the starting number and then the formula will fill in the rest for the number of rows you designate.
The next thing is to adjust the names. In this dataset we have student names and teacher names. We need to attack each a bit differently. If I don’t care about having actual names, I can simply type in a combination of the word “student” and then a number. For example, “Student001.” In the next row, I would type “Student002” and then drag and fill. Google Sheets will notice the pattern and continue it down the entire dataset.
But what if you want real names? No problem. I go to the website Random Name Generator. It will generate up to 100 male and/or female names for you. All names are based on the database from the U S census. I use this list to copy and paste into my student names. It’s pretty straightforward.
Matching up data that repeats
Teacher names are going to be a little bit different. We have teachers that are attached to students. The teacher name repeats several times in the data set. We want to make sure our data keeps integrity, so I need to get the teacher names by using the UNIQUE formula in a helper tab. I will then get names from the Random Name Generator website and put those in a column to the right of the real teacher names. From there, I will use a VLOOKUP formula to match the real teacher names with those I want to use for my sample data set. You can then drag the formula down through your entire data set and you should be good to go.
For school names, you can repeat the same method as we did with teacher names. In the column to the right of the unique values of school names, I usually just manually assign new school names then use the VLOOKUP formula to replace the school names in my data set. I then either use an ARRAYFORMULA or just drag the formula down to fill in the rest of the data set.
Change emails, but make them still look like emails
For email addresses, I use the CONCATENATE formula. This seems to be the simplest method. In my formula, I combine the student name and a string of text, such as “@studentemail.com.” I then do the same thing for teachers and use their name and a string such as “@teacheremail.com.”
What about dates that keep their integrity?
For dates and test scores, I end up making use of the RANDBETWEEN formula to come up with numbers that make sense. For dates, it is pretty simple. In my formula, I refer to the original date and then add on a random number between one and ten. This will add somewhere between one and ten days to the original date. That way, the dates are usually close enough together to still be accurate, but not exact.
Making adjustments to test scores
Let’s take a look at how we can create sample data for some test scores. You can use the RANDBETWEEN formula to determine your high and low scores. In my example, I have four quarters. I will assume that scores will be lower for the first and second quarter, so my range would be between 20-70 for a possible score. For quarters three and four, I’ll adjust my range to be from 50-100. That will allow for some growth. If you want to make the scores into percentages, you’ll just need to make a simple modification to the formula. At the end of the formula, just divide by 100 (/100) and when you format the cells for percentages, it will adjust to a more familiar number.
Show growth with external data
Now, what if I want it to show some growth in my sample dataset from external assessments? I can do this by copying the data and using another RANDBETWEEN formula nested within a RANDBETWEEN formula. If you are curious how I do this, you can check out the video to see more. (Jump right to the video walkthrough for this.)
I’m sure there are other ways, perhaps easier ways. However, I have found this workflow to be the most efficient for what I need to do to get set up for a sample data set. If you have a better way, or a different way to set up sample data, I’d love to hear your ideas. Please let me know in the comments.