Through Tom swanson, Head of Engagement at Heinz Marketing
Go directly to formulas
Novice analysts are often surprised at the time spent working with the data to clean it up so that the analysis itself can be performed smoothly. While some of us are happy with data cleansing, others may find it rather tedious. Either way, a well-formatted dataset makes analysis much easier and allows you, as a budding analyst, to quickly find the best information and show your expertise. Depending on the size of your dataset, this can be a daunting task, but luckily there are many simple and flexible features that speed up the process.
In this article, we’ll talk about qualitative data cleansing. Qualitative data is great, it gives context to otherwise cold, digital datasets. However, this comes at a high price because it is difficult to analyze on a large scale. In many survey questions, the responses aren’t numeric, but you want to see the frequency of responses, rotate the data, or be able to segment by responses. Well there are a few options to do that, the right one depends on how your raw data is formatted. Let’s dive in!
Answers are in the same cell, separated by commas
For simplicity, we’ll use a « Select all that apply » question. This is not exactly qualitative data, but it serves to demonstrate the functions, which can be applied in various ways. So, ask a question like this:
What animals would you like as pets? (Select all that relates to it)
Your data may look like this image:
Google Sheets, for example, defines « select all that apply » questions like this, with comma separated values in the same cell. We could manually count it, filter it, or use Google’s analytics interface, but large-scale segmentation will be tedious with data like this.
We want it to be easier to parse, which would benefit from converting the text responses to simple 1s and 0s. Like this:
It would be easy to do manually for a small dataset like these 10. But what if you had 100? 5000? Fortunately, there is a formula that can help us do this. It sounds intimidating, but it’s pretty straightforward and I’m going to break it down piece by piece.
= IF (ISNUM (SEARCH (value, target)), 1, 0)
Note: For Google Sheets, the formatting is almost the same and it always opens with an IF statement. It uses a concept called « regular expression » and searches for a match, but the result is the same. I will come back to this another time. See the Google formula above for different vocabulary and syntax.
The formula has 3 functions. Work from the inside to the outside:
TO LOOK FOR
The SEARCH function is useful for many things, but what matters here is that it returns a number if a string appears in a cell. If the value does not appear in the cell, you will get a #VALUE! Fault. If you want to know more, here is a little post.
The ISNUMBER will tell you whether a value is a number or not. If it is a number, it will indicate « True » and otherwise « False ».
#VALUE! is not a number. Hope you are starting to get a feel for how the rest of the formula works.
Ahh the IF function, the building block of glorious Excel logic. Its operation is quite simple:
= IF (parameter, return value if True, return value if False). Basically if it looks like this = IF (parameter, A, B) you will get A if the parameter is satisfied, otherwise you get B.
So back to our formula, hope you can make sense of it now. Here it is again:
= IF (ISNUM (SEARCH (value, target)), 1, 0)
The function uses SEARCH to get a number if the target response appears in a cell, then ISNUMBER converts it to True / False, and IF will return a 1 if ISNUMBER True was returned and 0 if False.
For column C in the image above, the formula for cell C2 looks like this:
= IF (ISNUM (SEARCH (C $ 1, $ B2)), 1,0).
Bonus Excel / Google Sheets tip: the $ keeps the column / row values static, so you can click and drag the formula over the other columns and at the bottom of the dataset, and all the necessary cells will be automatically put up to date.
The answers are in different cells
Another common format is that data is returned with as many columns as there are possible responses, but the responses are not in a specific column related to that specific response. Instead, the machine simply puts the first recognized value in the first column and goes there.
It’s hard to imagine, so here’s what it looks like:
It is difficult to analyze. You can’t filter it, so you have to clean it. Again, for a 10-person dataset, it would be faster to do this manually, but as before, it crumbles on a large scale. It’s a simple solution.
Here is the function: = COUNTIF (range, value)
IMPORTANT NOTE: COUNTIF adds 1 for each instance of a response that appears in a range. Therefore, only use this formula if your values are unique to each participant and in their own cells.
In this case, our range for line 2 would be B2: E2. We want to look for Cat in any of these cells. So we make Cat a column and that becomes our value, then we put the formula in the following cell (F2):
You can see the formula in the bar above. Again, the $ keeps these column / row values constant so that the formula can be moved.
Lovely, now we have a beautiful and clean dataset that can show us how people answered our important question.
These formulas are useful in more ways than the simple « Select all that apply » questions! They can be creatively applied to many different qualitative data sets. You can use them to find specific text strings in larger qualitative responses, generate frequency maps and word clouds, segment responses by including these items, and more.
As marketers and analysts, it can be too easy to ignore qualitative data because it is difficult to convert into metrics. However, qualitative data includes huge amounts of context that can really improve your strategy game. Having the tools to analyze qualitative data quickly and efficiently will give you much more useful information and allow you to sniff out the real story among all the words.
I’m always on the lookout for more formulas and ways to have fun with data, feel free to contact me with any ideas or comments!
To separate comma separated strings in the same cell:
Excel: = IF (ISNUMBER (LOOKUP (value, target)), 1, 0)
Google = IF (REGEXMATCH (target, value), 1, 0).
For strings isolated over a range of cells:
Excel and Google: = COUNTIF (range, value, 1, 0) – note that this will count twice if the values are not unique to each respondent.
If you want to know how it works and how to use them in a spreadsheet, I break them down with an example above.