![]() |
||||||||||||
![]() ![]() ![]() ![]() ![]() |
||||||||||||
Evidence-Based Information, Training and Tools
for Optimizing the Usability of Computer Systems
|
||||||||||||
|
|
||||||||||||
Evaluating the ‘Evaluator Effect’ September, 2003 The researchers had four evaluators individually analyze four videotaped usability test sessions, and found that only 20% of the 93 detected problems were detected by all evaluators. In addition, they had the evaluators individually select the ten problems they considered most severe, and reported that none of the selected severe problems appeared on all four evaluators’ top-10 lists. These researchers concluded that both the detection of usability issues, and selection of the most severe issues were subject to considerable individual variability. In other words, the evaluators were not agreeing with each other. The reliability of these usability methods had to be questioned. More recently, Hertzum and Jacobsen (2001) reviewed several studies that had reported on evaluators who were using one of the three most widely used usability evaluation methods -- heuristic evaluations, cognitive walkthroughs, and think-aloud evaluations. They found that the ‘evaluator effect’ for these three evaluation methods was about the same. The lack of agreement among evaluators was considerable for all three major evaluation methods. None of the evaluation methods elicited a better ‘evaluator effect’ -- they were all “equally poor.” The available research shows that the ‘evaluator effect’ persists across differences in
Hertzum and Jacobsen found that for all three evaluation methods, a single evaluator was unlikely to detect the majority of the severe problems that were detected collectively. Almost 10 years ago, Nielsen (1994) warned that, “the reliability of the severity ratings from single evaluators was so low that it would be advisable not to base any major investment of development time and effort on such ratings.” The more evaluators that are used, the greater will be the ‘evaluator effect.’ Using fewer evaluators may reduce the ‘evaluator effect’ but increases the likelihood of missing certain important usability issues. By using a larger number of evaluators, it may help to reduce the number of misses, but definitely will add to the number of false positives, i.e., identifying problems that are not problems. Hertzum and Jacobsen concluded that current evaluation methods are not as reliable as most believe. This is most likely due to relying too heavily on the knowledge-base, personal judgment capabilities, and work routines of individual evaluators. Keep in mind that the research is clear: Most individual usability evaluators do not agree with each other! Actually it is worse than that. Not only do individual evaluators tend to not agree with each other, their observations tend to not agree with performance test results. There is no question that considerably more research is needed in this area.
Jacobsen, N.E., Hertzum, M. and John, B.E. (1998), The evaluator effect in usability studies: Nielsen, J. (1994), Enhancing the explanatory power of usability heuristics, CHI'94 Conference Proceedings, 152-158. |
||||||||||||
Home Contact Dr. Bob Bailey at (801) 201-2002 or bob@webusability.com Copyright 2002 - 2005 |
||||||||||||