logo
HomeTrainingPublicationsUsability ToolsAbout
Evidence-Based Information, Training and Tools for Optimizing the Usability of Computer Systems

Evaluating the ‘Evaluator Effect’

by Dr. Bob Bailey

September, 2003


A few years ago, Niels Ebbe Jacobsen, Morten Hertzum and Bonnie John (1998) observed that the ‘evaluator effect’ had received little study. The ‘evaluator effect’ is when different evaluators evaluating the same system detect substantially different sets of usability issues. They conducted research that showed how insidious and potentially destructive the ‘evaluator effect’ could be in usability testing.

The researchers had four evaluators individually analyze four videotaped usability test sessions, and found that only 20% of the 93 detected problems were detected by all evaluators. In addition, they had the evaluators individually select the ten problems they considered most severe, and reported that none of the selected severe problems appeared on all four evaluators’ top-10 lists.

These researchers concluded that both the detection of usability issues, and selection of the most severe issues were subject to considerable individual variability. In other words, the evaluators were not agreeing with each other. The reliability of these usability methods had to be questioned.

More recently, Hertzum and Jacobsen (2001) reviewed several studies that had reported on evaluators who were using one of the three most widely used usability evaluation methods -- heuristic evaluations, cognitive walkthroughs, and think-aloud evaluations.

They found that the ‘evaluator effect’ for these three evaluation methods was about the same. The lack of agreement among evaluators was considerable for all three major evaluation methods. None of the evaluation methods elicited a better ‘evaluator effect’ -- they were all “equally poor.”

This was somewhat understandable with heuristic evaluations that were usually very informal. However, it was less understandable with cognitive walkthroughs that used much more rigorous procedures. The ‘evaluator effect’ was even less understandable with Think Aloud evaluations because users were all working with the same scenarios, and the only issues that should have been reported were those where users actually were observed having problems.

The available research shows that the ‘evaluator effect’ persists across differences in

  • Evaluator experience,
  • Evaluation methodology,
  • System domain,
  • System complexity,
  • Prototype fidelity, and
  • Problem severity.

Hertzum and Jacobsen found that for all three evaluation methods, a single evaluator was unlikely to detect the majority of the severe problems that were detected collectively. Almost 10 years ago, Nielsen (1994) warned that, “the reliability of the severity ratings from single evaluators was so low that it would be advisable not to base any major investment of development time and effort on such ratings.”

The more evaluators that are used, the greater will be the ‘evaluator effect.’ Using fewer evaluators may reduce the ‘evaluator effect’ but increases the likelihood of missing certain important usability issues. By using a larger number of evaluators, it may help to reduce the number of misses, but definitely will add to the number of false positives, i.e., identifying problems that are not problems.

Hertzum and Jacobsen concluded that current evaluation methods are not as reliable as most believe. This is most likely due to relying too heavily on the knowledge-base, personal judgment capabilities, and work routines of individual evaluators.

Keep in mind that the research is clear: Most individual usability evaluators do not agree with each other! Actually it is worse than that. Not only do individual evaluators tend to not agree with each other, their observations tend to not agree with performance test results. There is no question that considerably more research is needed in this area.


References


Hertzum, M. and Jacobsen, N.E. (2001), The evaluator effect: A chilling fact about usability evaluation methods, International Journal of Human-Computer Interaction, 13(4), 421-443.

Jacobsen, N.E., Hertzum, M. and John, B.E. (1998), The evaluator effect in usability studies:
Problem detection and severity judgments, Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting, 1336-1340.

Nielsen, J. (1994), Enhancing the explanatory power of usability heuristics, CHI'94 Conference Proceedings, 152-158.

Home|Training|Publications|Usability Tools |About

Contact Dr. Bob Bailey at (801) 201-2002 or bob@webusability.com
Copyright 2002 - 2005