Evaluation

Versions of the interactive prototype were evaluated with six biologists, ranging from masters students in biology to research associates and postdocs. Participants were shown how the Underbelly works and given the opportunity to explore freely. As they explored they were asked to use the Thinkaloud technique (Van Someren et al, 1994) to describe their thoughts. The data collected was largely qualitative. Users were asked to speak aloud their immediate reactions, to describe what the algorithm was doing, and to speculate about what they might use the software for. When users would have a difficult time understanding certain concepts, their thoughts would be documented and then the area of confusion would be explained to them. This allowed us to continue to observe their explorations, which would otherwise be thwarted by misunderstanding. Each user spent 20-30 minutes with the software.

The main findings of the study were threefold. First, roughly half of the participants were very curious about Underbelly and how the sequence alignment was working, and these participants were able to attain a decent understanding of the algorithm in the 20-30 minutes they spent with it. Second, the extent of participants curiosity seemed to depend on the depth of their previous experience with sequence alignment tools. Lastly, in two cases the tool supported spontaneous critical thinking about the function of the software and whether it was the appropriate tool for specific research tasks.

Although none of the participants knew what to expect before seeing Underbelly, half of the participants tested became highly curious about the algorithm when they realize that Underbelly made it accessible to them. Two masters students in particular had done a significant amount of sequence alignment and genomic search and seemed to be genuinely fascinated by Underbelly. They had become somewhat familiar with the idea of the sequence alignment process, but they had resigned themselves to the idea that only computer scientists and bioinformaticists would be allowed to understand how it works. Realizing that they could actually peer into that box that had been permanently labelled "off-limits" seemed to excite them. Two older postdocs who had not done much sequence alignment, or had only done very simple alignments found it to be largely irrelevant to their goals as a biologists.

There were two clear examples of participants engaging in fairly deep critical thinking about the alignment algorithm without prompting. One research associate became concerned when he started to understand how simple the sequence alignment algorithm actually is. He said he trusted it less now that he knew how simple it was. He then started pointing at the sequences, and said he would like to be able to change the sequences and see what would happen. He asked what would happen if you ran the algorithm with one sequence and a second sequence that was exactly the same except that it had the first half swapped with the second half. He felt that kind of swapping was biologically feasible, but he seemed to have an intuition that the algorithm would not handle that situation gracefully. In fact, he was right. The Smith-Waterman algorithm would give that sequence a rather low score, because only half of the sequence would be able to match.

A second participant spent about ten minutes exploring the algorithm and then started describing a new kind of tool that would allow him to choose between different alignments for a different parts of a sequence: “I wonder: why there isn't a program that could have a window, and you could click.... if you don't like the way it's lined up, it would give you several options, and then score values for that region.” This kind of critical thinking about what the software is doing and what is possible is exactly the kind of thinking described in the project’s design goals.

What remains to be seen is whether biologists would actually invest the time in working with Underbelly. In our user tests, they were asked to explore the software and given support. In some cases, users ran developed misunderstandings that they were unable to overcome. For example, one user developed the conviction that the Maximum Match score was an indicator of similarity between two residues. She could see that this was not the case, but she was unable to figure out why. It is not clear how Underbelly could be improved in this situation.