![]() |
||||||||||||
![]() ![]() ![]() ![]() ![]() |
||||||||||||
Evidence-Based Information, Training and Tools
for Optimizing the Usability of Computer Systems
|
||||||||||||
|
|
||||||||||||
Calculating the Number of Test Subjects December, 2000
Many people have requested an explanation on how to use the binomial probability formula used for calculating the number of subjects needed. Hopefully, the following information will help in clarifying the major issues. The original reference for the formula, as it relates to usability testing, goes back to Bob Virzi at GTE in 1990. Virzi's article was followed by one from Jim Lewis at IBM in 1993, and another one by Lewis in 1994. Many statistics books contain the formula for calculating a binomial probability, but these two sources have usability-related examples. I have taken their original write-ups and added new information plus some examples in the third edition of my Human Performance Engineering textbook (pp. 210-215). The actual formula is 1-(1-p)n, where p is the probability of the usability problem occurring, and n is the number of test participants required. Based on the Palm Beach county voting returns, we know that "p" is 0.01, and we are interested in finding out "n." In other words, we are trying to find a problem that is only a difficulty for one out of 100 people ("p"), and we want to estimate the number of subjects necessary to feel confident that we can find this problem (or problems). Generally, we apply this formula to determine the minimum number of test subjects needed to find a certain percentage of the usability problems in a system or in a Web site. Unfortunately, we never know how many usability problems actually exist in a new system, and we do not know what percent of the actual problems each test subject (or heuristic evaluator) will help us find. Virzi originally proposed that it was .40 (Virzi, 1990) and Nielsen has been advocating .31 (Alertbox: March 19, 2000). The major problem with either the .40 or the .31, or any similar numbers, is that they represent the proportion of usability problems found by one evaluator (or one test subject) over the total found by all evaluators (or all test subjects). The number of usability problems found by all evaluators is not the actual number of usability problems in a system (see Bailey, et.al., 1992). The evaluators will miss finding or experiencing certain problems, and they will think that a relatively large number of issues are usability problems when they are not problems. We usually refer to these latter problems as "false alarms" (Catani and Biers, 1998; Stanton and Stevenage, 1998; Rooden, et.al., 1999). Based on the studies just referenced, there can be as many as two false alarms for every true problem. Lewis at IBM (1994) reported on a study where his participants were test subjects. They used a system where he had created numerous usability problems ("salted the mine"). They experienced a combined total of 145 problems. He calculated that the average likelihood of any one subject experiencing a problem was .16 (obviously this is far less than .40 or .31). If a system truly contained 145 usability problems, and if each person experienced only about 16% of all the problems, and if we had five participants, we could use the formula to calculate what percent of the problems all five subjects would be expected to uncover. 1-(1-.16)5 = 1-(.84)5 = 1-.42 = .58 Using the five test subjects, we would expect to find about 58% of the problems. If we used ten test subjects, what percent of the 145 problems would we expect them to uncover? 1-(1-.16)10 = 1-(.84)10 = 1-.17 = .83 Using the ten test subjects, we would expect to find about 83% of the usability problems. To put it another way, we would expect to find and (hopefully) fix those problems that could pose some difficulty to about four out of five users. The major assumption here is that each subject will, on average, experience about 16% of the problems. Usability professionals try to use the appropriate number of subjects that will enable them to accomplish the goals of a usability test as efficiently as possible. If we use too many, we can increase the cost and development time of a system. If we use too few, we may fail to detect some serious problems, and could reduce the overall usability of the product. When designing Palm Beach county's ballot, Ms. LePore used far too few. The Buchanan (butterfly ballot) problem provided a unique experience for usability professionals. It provided us with one of the numbers we usually do not have - The actual (true) proportion of people who had difficulty voting because of one or more usability problems related to the ballot. There were 2,800 "erroneous" votes made by 272,532 actual and potential Gore voters. This was about one out of 100 or 0.01. In other words, 99% of the users (voters) dealt effectively with the ballot's usability-related problems, but 1% did not. The question then becomes, how many test subjects would have been needed to find (identify) the usability problems that posed difficulties to this relatively small number (1%) of users? Most usability testers never worry about these problems because the cost (in terms of the time needed to conduct the tests, and the large number of test subjects needed) for finding these difficulties is too great for most systems. Obviously, if the penalty for making errors was serious injury, loss of life, huge "support" costs, losing millions of dollars in sales, or a lost presidential election, then it may be worth the money to find and fix the problems. I applied the binomial probability formula to estimate the number of usability test subjects Ms. LePore would have needed. In this case, "p" is .01, which is the probability of the usability problem occurring. Without building a special program to solve for "n," we simply increased "n" in the formula until we found the number of subjects needed to find either 95% or 99% of the ballot problems using a usability test.
Another way of thinking about the problem is that if any one participant has a low probability of having difficulties with the ballot, which the actual numbers show, the total number of participants needed to find difficulties like the Buchanan (butterfly ballot) problem can become very high. In this case, 289 subjects would be needed to find 95% of those problems that are only difficulties to a very small number (1%) of voters. Four hundred and twenty-three would be needed to find 99%.
References Bailey, R.W. (1996), Human Performance Engineering: Designing High Quality, Professional User Interfaces for Computer Products, Applications and Systems , Prentice Hall: Englewood Cliffs, NJ. Bailey, R.W., Allen, R.W. and Raiello, P. (1992), Usability testing vs. heuristic evaluation: A head-to-head comparison , Proceedings of the Human Factors Society 36th Annual Meeting, 409-413. Catani, M. B. and Biers, D. W. (1998), Usability evaluation and prototype fidelity: Users and usability professionals , Proceedings of the Human Factors and Ergonomics Society 42nd Annual Meeting, 1331-1335. Lewis, J.R. (1994), Sample sizes for usability studies: Additional considerations , Human Factors, 36(2), 368-378. Lewis, J.R. (1993), Problem discovery in usability studies: A model based on the binomial probability formula , Proceedings of the 5th International Conference on Human-Computer Interaction, 666-671. Rooden, M.J., Green, W.S. and Kanis, H. (1999), Difficulties in usage of a coffeemaker predicted on the basis of design models , Proceedings of the Human Factors and Ergonomics Society - 1999, 476-480. Stanton, N.A. and Stevenage, S.V. (1998), Learning to predict human error: Issues of acceptability, reliability and validity , Ergonomics, 41(11), 1737-1747. Virzi, R.A. (1990), Streamlining the design process: Running fewer subjects , Proceedings of the Human Factors Society 34th Annual Meeting, 291-294. |
||||||||||||
|
Home Contact Dr. Bob Bailey at (801) 201-2002 or bob@webusability.com Copyright 2002 - 2005 |
||||||||||||