If you want to know when new articles go
subscribe to the WebWord.com
Statistically Analyzing Success Rates in Web Usability Testing: The Cochran's Q Test
The Cochran's Q Test can be used to statistically analyze success
rate data. It
can be used even when only a small number of users are tested. This provides
some indication that a vast amount of usability data can, and should be
In a recent article,
"Success Rate: The Simplest Usability
Metric", Jakob Nielsen recommends success rates as a useful descriptive statistic for web usability data. Essentially the
success rate is the proportion of users who have successfully completed a particular
task. More generally, it can be defined as the number of successes divided by the number task attempts. One way to represent success data can be as zeros and ones. For example, "1" could represent success and "0" can represent failure in a database. Nielsen indicates that other frameworks for coding success are available. For example, the analyst may want to record partial successes for users who complete some a task.
WebWord, we certainly agree that data and quantitative information is a useful way to report usability test results. However, the more technically inclined consumer of usability data may wonder how success rate data might be subjected to formal statistical analyses.
Statistical analyses are important for assessing the reliability of a result. However, since usability data often consist of a small number of users, formal statistical analyses are often not powerful enough to be useful. On the other hand, repeated measures designs, where each user generates data in several conditions, are statistically powerful and efficient techniques.
Understanding the Cochran's Q
One statistical technique is
the Cochran's Q test, which is designed to assess differences across conditions in which there is a dichotomous
outcome. For usability data, users might generate success data in multiple conditions or across multiple tasks. Cochran's Q can test whether the responding across the conditions are significantly different from each other. It can be effective for data sets with a small number of users. In fact, for designs with dichotomous responses (0's and 1's), a sample size of 16 or greater allows the analyst to safely conduct traditional analysis of variance (ANOVA).
ANOVA is a technique common in many social, behavioral and biological sciences. Many readers are undoubtedly familiar with ANOVA and its flexibility (e.g., the ability to conduct omnibus and focused tests).
Cochran's Q is effective with smaller samples, especially when the probability of a response (success for example) is approximately .5. For an example of Cochran's Q, consider the following data culled from Nielsen's recent article:
Note that each user generates a dichotomous response, success or failure, for each task. It also should be noted that in Nielsen's data, he had coded some tasks as partial successes, which have been recoded here as failures for purposes of exposition. This example speaks to the value of a formal statistical test in that the pattern of data may not be obvious from simple success rates, especially with only four users. The analyst may want to determine if the six tasks differ amongst themselves.
A formal statistical test can be conducted by positing the following null hypothesis:
πk is the probability of success on the
kth task. Note in this case, there are k=6 tasks.
In the Cochran's Q test, the null hypothesis is that the probability of the target response is equal across all groups. For those unfamiliar with formal statistical tests, the question being asked is the following: "What is the probability I would have obtained my result assuming the null hypothesis is true?" If the obtained results are likely assuming the null hypothesis, then the analyst concludes there is no difference among the groups.
Computation of the Cochran's
The next section of this article describes the computation of the Cochran's Q. However,
WebWord has developed an Excel spreadsheet implementation of the Cochran's Q test
(download it now). The spreadsheet contains an example (Nielsen's data) of the Cochran's Q test, as well as all the relevant information needed to interpret the test. Hence, the reader unfamiliar with formal statistical testing is encouraged to examine the accompanying spreadsheet to aid in understanding the computations.
S represents users completing a different tasks, each denoted as a level of the factor
A, the Q statistic is defined as:
SSA (sums of squares for factor
A) is computed by the mean at the jth level of A and subtracting the grand mean from it, and then squaring that quantity. Note that "j" is an index for the a levels of the factor
A. For Nielsen's data, note that a=6 because there are six tasks. This squared quantity is computed for each subject and summed across each subject. See
Appendix A for the equations that go into the Q statistic.
MSA/S (mean squared
A within S) is simply the average variance of the scores within a subject across the levels of
A, and then averaged across subjects. In other words, the variance of the a scores averaged across each of the
n users. Recall a refers to the number of levels of the factor A. In addition, n refers to the number of users. Note that for Nielsen's data, the number of users is n=4. Recall that factor
A is the different tasks. The n users compose the S source of variability.
Q statistic is distributed as a c2
statistic with degrees of freedom equal to the number of levels (e.g., tasks) in the experiment minus one (i.e., a-1). It should be noted that the spreadsheet contains the critical values of the
distribution for degrees of freedom ranging from 1 to 20 (most usability tests will not have more than 20 tasks). The critical values cut off the
95th and 90th percentile of the c2
distribution so that Type I error rate will be at the conventional .05 and also the more liberal .1. The more liberal criterion of .1 might be indicated for usability testing with few users. The critical values were generated using
Mathcad ® 2000 for Windows.
The purpose of this article was to raise awareness of the Cochran's Q statistic for usability data. Although usability data is often not formally analyzed because of the small number of users typically included in usability studies, it is still important to be aware of analytic tools. The astute reader might raise the question about statistical power, recommended sample sizes, effect sizes, contrasts, etc. For the moment these issues will be deferred to future articles, although it is worth pointing out that
repeated measures designs are more powerful than between-subjects designs.
For example, although Nielsen's data only includes four users, it yields the following computations of the Cochran's Q:
Q=3.33 ¸ .25=13.333 on 6-1 degrees of freedom which is
statistically significant at the .05 level. This indicates that if the probability of success were equal for all tasks, there is less than a 5% chance of obtaining these results. The analyst would then conclude that there is a difference among the six tasks.
This analysis was also conducted on SPSS ® version 8.0 for Windows and it yielded an exact p-value of .02. This indicates if the null hypothesis was true, there is only a 2% chance Nielsen would have obtained his results. This suggests there is a difference among the six
tasks (refer to the table above). More importantly, this difference was detected with only four
This test is useful to clarify differences among user tasks. It is an omnibus test that only tests if there is at least one difference among the groups, as opposed to where the difference is located. One liberal strategy for focused tests (or contrasts) would be to conduct the omnibus test with the Q statistic. If the omnibus test detects a difference and rejects the null hypothesis, the analyst could then conduct pairwise Q tests to see which means differ from each other. Although this is a liberal approach, it is no more liberal than doing no statistical test at all. In addition, it is analogous to Fisher's least significant difference test for pairwise contrasts familiar to readers versed in ANOVA.
The Cochran's Q is a one test used to analyze dichotomous responses. Other tests may also be appropriate for identifying factors that affect user success on web tasks (e.g.,
logistic regression). We hope this brief presentation of Cochran's Q will help usability analysts better understand success and failure data. The equations for the components of the Cochran's Q test are listed below for the interested reader. However, in addition to this article, the reader can
download a Microsoft Excel ® spreadsheet implementation of the Cochran's Q which can be easily modified for their own purposes. For the interested reader, a thorough exposition of the Cochran's Q can be found in Myers & Well (1995), Research Design & Statistical Analyses. The present article uses the notation and equations found in the Myers & Well text, but our intent was not to simply present the Cochran's Q, which is done in many texts, but to describe its applicability to web usability analyses.
A -- Computations for the Cochran's Q
A refers to factor A (the different tasks in the Nielsen example). The number
of levels of A (e.g., tasks) is a.
S refers to the users. n refers to the number of users.
To Compute SSA
is the mean for the jth task where "j" is an index for the levels of the factor
is the grand mean, or the average score across levels of the factor A. The grand mean for Nielsen's data is .375.
is the variance of the a scores for the ith user where "i" is an index for the
where Yj is the score (i.e., 1 for success, 0 for failure) for the ith user on the jth task.
is the mean for the ith user across the a tasks.
To Compute Degrees of Freedom (df)
df=a-1 where a is the number of levels of the factor A (e.g., tasks)