When users select one categorical variable with three or more groups and one continuous or discrete variable, Statwing runs a one-way ANOVA (Welch’s F test) and a series of pairwise “post hoc” tests (Games-Howell tests). The one-way ANOVA tests for an overall relationship between the two variables, and the pairwise tests test each possible pair of groups to see if one group tends to have higher values than the other.

Statwing recommends an unranked Welch’s F test if several assumptions about the data hold:

- The sample size is greater than 10 times the number of groups in the calculation (groups with only one value are excluded), and therefore the Central Limit Theorem satisfies the requirement for normally distributed data.[1]
- There are few or no outliers in the continuous/discrete data.[2]
- The data are in fact continuous or discrete and not ordinal.[3]

Unlike the slightly more common F test for *equal* variances, Welch’s F test does not assume that the variances of the groups being compared are equal. Assuming equal variances leads to less accurate results when variances are not in fact equal, and its results are very similar when variances are actually equal (Tomarken and Serlin, 1986).

When assumptions are violated, the unranked ANOVA may no longer be valid. In that case, Statwing recommends the *ranked* ANOVA (also called “ANOVA on ranks”); Statwing rank-transforms the data (replaces values with their rank ordering) and then runs the same ANOVA on that transformed data.

The ranked ANOVA is robust to outliers and non-normally distributed data. Rank transformation is a well-established method for protecting against assumption violation (a “nonparametric” method), and is most commonly seen in the difference between Pearson and Spearman correlation. Rank transformation followed by Welch’s F test is similar in effect to the Kruskal-Wallis Test (Zimmerman, 2012).

Note that Statwing’s ranked and unranked ANOVA effect sizes (Cohen’s f) are calculated using the F value from the F test for equal variances.

Statwing runs Games-Howell tests regardless of the outcome of the ANOVA test (as per Zimmerman, 2010). Statwing shows unranked or ranked Games-Howell pairwise tests based on the same criteria as those used for ranked vs. unranked ANOVA; so if you see “Ranked ANOVA” in the advanced output, the pairwise tests will also be ranked.

The Games-Howell is essentially a t-test for unequal variances that accounts for the heightened likelihood of finding statistically significant results by chance when running many pairwise tests. Unlike the slightly more common Tukey’s b test, the Games-Howell test does not assume that the variances of the groups being compared are equal. Assuming equal variances leads to less accurate results when variances are not in fact equal, and its results are very similar when variances are actually equal (Howell, 2012).

Note that while the unranked pairwise test tests for the equality of the *means* of the two groups, the ranked pairwise test does not explicitly test for differences between the groups means or medians. Rather, it test for a general tendency of one group to have larger values than the other.

Additionally, while Statwing does not show results of pairwise tests for any group with less than 4 values, those groups are included in calculating the degrees of freedom for the other pairwise tests.

Please contact us if you have questions or feedback about ANOVA or pairwise tests in Statwing or the explanations above.

1. With smaller sample sizes, data can still be visually inspected to determine if it is in fact normally distributed; if it is, unranked t-test results are still valid even for small samples. In practice this assessment can be difficult to make, so Statwing recommends ranked t-tests by default for small samples.

2. With larger sample sizes, outliers are less likely to negatively affect results. Statwing uses Tukey’s “outside fence” to define outliers as points more than 3 times the intra-quartile range above the 75th or below the 25th percentile point.

3. Data like *Highest level of education completed* or *Finishing order in marathon* are unambiguously ordinal. Though Likert scales (like a 1 to 7 scale where 1 is *Very dissatisfied* and 7 is *Very satisfied*) are technically ordinal, it is common practice in social sciences to treat them as though they are continuous (i.e., with an unranked t-test).