Monday, October 27, 2008

Rolling your own free, customized, free, multiplatform, and free qualitative data analysis tool. For free.

Qualitative data analysis tools are expensive. When I came to UT in 2001, I had the university spend $500 on one popular QDA tool, NVivo. It was going to change the way I did research. So I installed it, played around with it, was not impressed, and abandoned it.

Much more recently, I decided to try HyperResearch on the advice of a grad students from Education. Again, UT sprang for the $400 needed to buy it. I used it for two studies and again, I was not impressed: in some ways it was very limiting, particularly in terms of relating various types of data and coding. The interface was clunky.

And look: $900 spent for nothing.

But between those two times, I managed to analyze 89 sets of observations, 84 interviews, and assorted artifacts. This work followed me across three platforms (Linux, MacOSX, OpenZaurus), and it didn't involve an off-the-shelf qualitative research tool. I'm coming back to this solution for managing the data in my latest study, a study of collaboration and project management at high tech organizations. It offers better print formatting, more flexible data analysis, and multiple interfaces that can be chosen for the specific type of analysis or data entry. It's multiplatform. Fast. And it didn't cost me a dime.

So how do you save $400, $500, or even $900 on your next qualitative research project? It takes a little setup, but you can do it.

When you're analyzing qualitative data, you might have several different kinds of data. Here's the data types I regularly use:
  • Interviews (audio recordings and transcriptions)
  • Observations (transcribed field notes)
  • Artifacts (usually digital photos or paper that can be scanned; I have also recorded ambient noise at sites.)
You might also use other data, such as system logging.

In addition, you typically have administrative data such as information on participants (I include first and last name, pseudonym, and title at minimum).

For each of these data, qualitative analysis includes coding. You can code in several different ways, but let's keep it simple and think of coding as free-form tagging.

So how do you make sense of all this? Let's start with some don'ts:

Don't use Excel or other spreadsheets. Spreadsheets only offer two dimensions, and that means you're very limited in how you analyze the data. You'll end up doing one of the following:
  • Creating a spreadsheet for each datatype. So you'll have spreadsheets for observations, for interviews, etc. Since spreadsheets don't provide an intuitive or robust way to link data between spreadsheets, you'd have to do that connective work by hand.
  • Creating a single spreadsheet into which all data go. This will involve tremendous redundancy, with several fields going empty in every entry -- and lots of redundant data, since you'll have to tag name and date for every entry.
Don't try to manage all this outside of a table. Sure, you could dump your data in a big Word file and use comments for tagging, and sure, you could search text and comments. But you lose a lot of granularity that way, as well as the ability to gain a top-level view (e.g., how many times did I use this code vs. that code?).

Don't store your data online. Several free web-based services offer great solutions. But your data will not be secure. In many cases, you simply won't be allowed to store your data on an unencrypted server that isn't administered by the university.

So that's what you don't do. Now here's what I do.

Overview of My System
I use a MySQL database to store the data, with a different database table for each kind of data. The first table to set up is the Participant table, with each participant receiving a key index number. Other tables are all indexed by that participant number, so I can join tables based on participant.

Each table has a CODES field where I can insert codes from a list. I keep the list of codes in a text editor and surround each one with asterisks like this:
The asterisks allow me to search across a table and pick up just the codes -- searching for "**COMPANY" picks up codes that start with that string, while searching for "COMPANY" might pick up uses of the actual word in interview or observational notes.

To analyze the data, I use several MySQL front ends, including YourSQL, CocoaMySQL, and phpMySQL. These front ends are all free, they afford different views of the data, but they all work on the same underlying data. The result is far more flexibility than I would get from an off-the-shelf QDA tool.

Obviously, this solution isn't for everyone:
  • You don't have to learn SQL, but learning just a little bit will make your life a lot easier.
  • You may have a hard time storing files in your SQL database, depending on your front end. I typically store them on the hard drive and store filenames and metadata in the database.
  • This method allows you to code by line, not by line portions or longer blocks.
How to Set it Up
The setup is not hard, but you'll need to be comfortable with uncertainty. Or get your system administrator to do it.

1. Download and install MySQL.
Go to (or and download the free software. It has versions for several operating systems. The site also has a ton of documentation; keep a window open for installation.

2. Download one or more MySQL front ends.
Cruise on over to and search for SQL. You should get a large list of SQL utilities and clients, some of which will be applicable, many of which won't be. I am using OSX, so I downloaded the following front ends:
  • YourSQL
  • CocoaMySQL
  • phpMySQL (this one runs on your internal web server, so it works across platforms, just like MySQL. It will take some additional setup.)
3. Create a database.
Follow your MySQL installation instructions to set a root user and password. (You can set different user and permission levels, but if you're the only one using the database, why bother?)

Once you do this, run your front end (or one of them, if you downloaded several) and follow instructions to connect to MySQL. Then create a database. I suggest naming it something descriptive -- not "research". For instance, I named the database for my current project "research-pm" -- the same name I used for my tags in GMail, GDocs, and Remember the Milk for the same project.

4. Create a table for participants. Create rows.
Now you create tables within the database. MySQL is a relational database, which means that you can relate the tables in different ways once you have them set up. I typically make the participants table the "handle" for most of the rest of the database, since most of my analysis focuses on what individuals do and say. So we create that one first.

So what do you need to know about your participants? I usually put in the following information:
  • pkey: a participants key. It's a unique integer that identifies the participant. When you refer to participants in other tables -- such as observational notes -- you can use that same number to designate the same participant in these other tables.
  • lname: Participant's last name.
  • fname: Participant's first name.
  • fname_p: Pseudonym.
  • position: text field for their job title (or similar information that might be relevant, such as profession).
  • site: If the study includes multiple sites, use either a text string or a number to indicate each.
  • Observation and interview dates: Depending on the data collected, you might or might not include these dates. Usually you can get these from querying the appropriate data tables.
Once you have roughed out the participant table, fill it out with information about each participant.

5. Create tables and rows for each kind of data you collect.
Each will be indexed to the participants table. For my current study, I created:
  • observations
  • interviews
  • interviewfiles
  • artifacts
  • site notes
For each of these, create at least the following:
  • key: The unique key for this piece of data. If it's from an observation, you might call this "okey," etc.
  • pkey: This field links the individual to her or his data. If a given observation was of participant 1, you'd put a 1 here.
  • date: The date you collected the data. If it's an observation, you might call it "obsdate," etc.
  • text: The data itself. For instance, if you're filling in observational notes, "obstext" would contain perhaps a paragraph from your notes. If it's an interview, "inttext" would contain an answer or paragraph from the transcribed interview.
  • codes: The codes you assign to this piece of data.
  • notes: Any additional information you might want to insert that doesn't fit into the fields above. Sometimes I use this to make notes about further investigation, artifacts I should collect, or methodological issues.
6. For each table, fill out rows.
You can do this manually via one of the front ends. You'll find that each front end has advantages and disadvantages in terms of data entry.

If you don't mind learning a little SQL, you can take your raw data (say, observational notes or transcribed interview notes) and insert the appropriate SQL around them with some search and replace commands. Once you do that, you can plug the whole mess in as a single query and it'll update the table with that data. That's what I do. It's much faster as long as you're willing to spend half an hour learning the appropriate SQL command (INSERT).

7. Code the table.
Now that the data are in the tables, code each table. In this scheme, that means filling the "codes" field for each row of each table. Codes can come from your starter codes, open coding, axial coding, or all three. I typically put them all in the same field; you could differentiate them or place them in different fields if you think you need that level of complexity.

Note: If you code thousands of lines of data with a code (say, **WORKPLACE**) and then decide you really need to rename this code (say, to **WORK**), you can do a search-and-replace with the "update" command. See documentation for details.

Similarly, you can do autocoding with an "update" command. For instance, suppose you want to make sure that each mention of "msword" in the field notes is coded with **SOFTWARE_OFFICE**. You can use "update" to search for those incidents and code them appropriately. Brute-force coding can be tricky -- you risk false positives and broad-brush characterization of the data -- but depending on your data, it can also be very useful and gain a lot of traction quickly.

How to Search
Now that you've entered and coded the data, you can do simple and complex searches.

1. Simple searches within tables
These are searches within one table. For instance, suppose you want to find a mention of msword in your observational notes just so you can look up the context. Or you want to see how many interview notes are coded with **SOFTWARE_OFFICE**. I usually use these two tools:

Search-as-you-type (YourSQL)
I love search-as-you-type. The idea is that as you start typing the string, the results reduce. Eventually you have zeroed in on the data you want, even before you're done typing.

The advantage is that you get the results quickly. The disadvantage is that this method searches across all fields, so you might get false positives. Suppose you're looking for "software" in the observational notes, but you catch all instances of **SOFTWARE_OFFICE** in the codes.

Search by string (CocoaMySQL)
This method allows you to specify the field and the relationship before you search. So you might set "obskey=1" to catch all observations of participant 1, or "codes like "%**SOFTWARE_%" to catch all observations where the codes field includes a code starting with "**SOFTWARE_".

The advantage is that the search is fine-grained and focuses on just one field. The disadvantages are that (a) it's not as fast as search-as-you-type and (b) you can't set up searches that look in more than one field.

But if you want to set up more complex searches that go across tables, you'll have to learn a little more SQL.

2. Complex searches joining tables
Since MySQL is relational, you can link these tables you've set up, and the result is a much more powerful set of queries.

Here's an example from the Telecorp study that became my second book. I had the following tables:
  • "workers"
  • "interviews"
Now suppose I want to grab all interview notes for Customer Service workers that are coded ***JOB_DESCRIPTION***", then append the workers' first names and pseudonyms to them so I can remember who they are. I ran this query so that I could see how the many different CS workers understood their jobs, especially so that I could zero in on differences in those understandings.

That's too complex for the simple queries earlier. So I ran the simple SQL query. The names after dots (ex: workers.fname) are field names in the given table.
select workers.wkey, workers.fname, workers.fname_p, interviews.notes, from workers, interviews where ((workers.area='Customer Service') and (workers.wkey=interviews.wkey) and ( like '%**JOB_DESCRIPTION**%'))
So we can get really specific searches that join the different tables and allow us to slice the data in different ways. I could have added further codes beyond job description, searched across additional areas, specified a date, etc. In fact, I did do all of these, and I occasionally joined three tables to yield really interesting connections among the different types of data.

Formulating these can be a pain, so I formulate them once, make sure they work, then save them. If I want to run it again with a different code, I copy and paste.

How to Print
One big problem with HyperResearch is that it does an appalling job printing data. In the system I've described, you could print in a number of ways. The best two are:
  • Use phpMySQL to generate the table you want, then print from the browser.
  • Use MySQL from the command line to dump the query into an HTML file.
As always, see the documentation.

So that's a lot to absorb, and I would have to write an entire tutorial to give you a more detailed idea of how to implement this system. Since I'm sort of busy with research, I won't do that. But don't hesitate to comment with specific questions.