logo
HomeTrainingPublicationsUsability ToolsAbout
Evidence-Based Information, Training and Tools for Optimizing the Usability of Computer Systems

How Reliable is Usability Performance Testing?

by Dr. Bob Bailey

October, 2001



Rolf Molich of DialogDesign in Denmark published two articles (Molich, et.al., 1998; Molich, et.al., 1999) over the past three years that helped us to understand better the limitations of even our best usability testing method – performance testing.

He and his colleagues did a comparative evaluation of usability tests by having four commercial usability labs carry out tests on the same commercially available calendar program. The purpose of the comparative evaluation was to observe the different ways in which independent laboratories conducted usability tests. The testers independently performed usability tests that each involved about five typical users, and then prepared a test report. Their results showed that some labs found few usability problems (4), while others found many (98).

 

Usability Laboratories
A
B
C
D
Usability Specialists
2
2
1
3
Number of Tests
18
5
4
4
Problems Found
4
98
25
35

 

Only one problem was found by all four teams, and over 90% of the problems found by each team was found only by that team.

Molich and his colleagues conducted a follow-up to the first test to determine if the results were unique or could be replicated. In the second study, seven different professional usability labs and two university student teams independently carried out usability tests of a well-known Web site – hotmail.com. They each prepared and submitted their standard test report. Again, their results showed that some labs found few problems (10), while others found many (150).

 

Usability Laboratories
A
B
C
D
E
F
G
H
I
Usability Specialists
2
2
1
3
3
1
1
3
7
Number of Tests
18
5
4
4
9
5
11
4
6
Problems Found
4
98
25
35
68
75
30
18
20

 

The results from the first study were, indeed, replicated. Again, there seemed to be little consistency across testing organizations. Over half (55%) of the problems found by each team were found only by that team.

More recently, Martin Kessner (Kessner, 2000; Kessner , et.al., 2001) from Carleton University in Ottawa had six usability testing teams conduct usability tests on a prototype of a system.

He attempted to improve the agreement of the testing teams by

  • testing a prototype that had not yet been used by actual users,
  • limiting the issues to be evaluated to five questions specified by designers,
  • focusing exclusively on usability issues (excluding all marketing and other issues),
  • having two evaluators group similar observations into categories of problems that were essentially the same, and
  • using only professional usability teams (no student teams).

From the original total of 117 potential "usability problems" reported by all the testing teams, the evaluators excluded 31 as non-usability problems. They then combined similar problems and ended up with a final number of 36 unique usability problems. Consistent with the first two studies, none of the problems was found by every team, and a large proportion of the problems (44%) were found by one team only.

When considering the five specific questions that designers wanted answered, there was moderate agreement among the teams on two questions, and low agreement on the other three.

Taken together, the findings of these three studies show that there is considerable need for improvement in the usability testing process. Contrary to what some would like us to believe, effective usability testing is extremely difficult to do well. As a discipline, we need fewer "discount" methods, and more research-based, truly valid methods for finding usability true problems.

These findings show that even experienced usability professionals have difficulty in identifying usability problems. Should designers trust all observations made by usability professionals? With this much variability in performance testing results, should Web site designers trust any observations made by usability professionals?

Usability professionals do not let clients drop off a prototype Web site with the request to find as many problems as possible; and professional designers do not take seriously the never-ending list of "problems" identified by someone who has a usability lab with fancy video equipment. Any amateur with a conference room and a couple of subjects can use a performance test to find all kinds of so-called "usability problems." Some do not even need the test subjects – they can find a multitude of "problems" just by staring at a website and fiddling with the links.

I agree with Kessner, et.al. (2001), the one thing that will most likely reduce the large-scale disagreements among usability testers is to have designers specify precisely the usability questions they have.Ideally, these questions will include the maximum allowable time for task completion, and a clear definition of success for each task. The true usability professional can then effectively use a performance test to identify those usability problems that most need finding and fixing.


References

Kessner, M. (2000), On the reliability of usability testing, Carleton University Masters Thesis, Ottawa, Ontario, December.

Kessner, M., Wood, J. Dillion, R.F. and West, R.L. (2001), On the reliability of usability testing, CHI 2001 Poster.

Molich, R., Thomsen, A.D., Karyukina, B., Schmidt, L., Ede, M., Oel, W.V. and Arcuri, M. (1999), Comparative evaluation of usability tests, CHI'99 Extended Abstract, 83-84 (Summary available.)

Molich, R., Bevan, N., Curson, I., Butler, S., Kindlund, E., Miller, D., Kirakowski, J. (1998), Comparative evaluation of usability tests, Proceedings of the Usability Professionals Association. (Summary available.)

 

 

Home|Training|Publications|Usability Tools |About

Contact Dr. Bob Bailey at (801) 201-2002 or bob@webusability.com
Copyright 2002 - 2005