Chapter 17 – Testing for differences between two samples
This chapter introduces statistical tests for assessing the significance of differences between two samples and also introduces calculations of effect size and power.
Exercises
Here are the data sets, in SPSS and in MS Excel, for the results that are calculated by hand in this chapter of the book. The related t, unrelated t and single sample t data sets are all contained in the Excel file t test data sheets.xls. The files in SPSS are unrelated t sleep data.sav, related t imagery data.sav and single sample t test data.sav.
Exercise 17.1
The data files for the non-parametric tests are linked below. The excel file nonparametric test data.xls contains the data for the Mann-Whitney, Wilcoxon and Sign test calculations. The SPSS files are, respectively, mannwhitney stereotype data.sav, wilcoxon module ratings data.sav and sign test therapy data.sav.
t tests on further data sets
Data sets are provided here that correspond with the three research designs described below. Your first task is to identify which type of t test should be performed on the data for each design: choose from:
Unrelated t test Related t test Single sample t test
Scenario 1: Participants are asked to solve one set of anagrams in a noisy room and then solve an equivalent set in a quiet room. The prediction is that participants will perform worse in the noisy room. Data are given in seconds.
Show answer
related t test
Reason: Repeated measures design. Each participant performed in both conditions.
Scenario 2: A sample of children is selected from a ‘free’ school where the educational policy is radically different from the norm and where students are allowed to attend classes when they like and are also involved in deciding what lessons will be provided by staff. It is suspected their IQ scores may be lower than the average.
Show answer
single sample t test
We have only one group, but we can compare their mean with the standardized mean for IQ of 100.
Scenario 3: One group of participants is asked to complete a scale concerning attitudes to people with disabilities. A second group of children is shown a film about the experiences of people with disabilities and then asked to complete the attitude scale a week later. The research is trying to show that changes in attitude last beyond the limits of the typical short-term laboratory experiment.
Show answer
Unrelated t test
Two separate groups each with a mean on the same DV. Independent samples design.
Now conduct the appropriate test on each data set and give a full report of the result including: t value, df, p value (either exact or in the ‘p less than …’ format), 95% confidence limits for the mean difference and effect size.
Show answer
Scenario 1 (related t)
The mean time to solve anagrams in the noisy room (M = 193.45 secs, SD = 43.16) was higher than the mean time for the quiet room (M = 178.25, SD = 24.52) resulting in a mean difference of 15.2 seconds. This difference was not significant, t (19) = 1.558, p = .136. The mean difference (95% CI: -5.22 to 35.62) was small (Cohen’s d = 0.35).
Scenario 2 (single sample t)
The ‘free’ school children had a lower mean than the standard average IQ of 100 (M = 97.7, SD = 9.5). This difference, however, was not significant, t (24) = 1.22, p = .236. The difference between the sample mean and the population mean was small (2.32, 95% CI: -6.26 to 1.62, Cohen’s d = 0.243).
Note that SPSS uses the sample SD to calculate Cohen’s d.
Scenario 3 (unrelated t)
The film group produced a higher mean attitude score (M = 25.65, SD = 5.25) than the control group (M = 22.2, SD = 4.76). The difference between means was significant, t (38) = 2.18, p = .036. The difference between means (difference = 3.45, 95% CI: 0.24 to 6.66) was moderate (Cohen’s d = 0.69).
Note Effect size is calculated using where s is the mean standard deviation for the two groups ( sample sizes are equal).
Exercise 17.2
Non-parametric tests on the scenario data sets
Select below the appropriate non-parametric tests that can be used on the Scenario 1 and 3 data from the t test exercises. In one scenario more than one appropriate test can be selected.
Scenario 1
(Anagrams in noisy and quiet rooms)
Wilcoxon Mann-Whitney Sign test
Show answer
Wilcoxon and Sign test, but Wilcoxon is a lot more powerful.
Scenario 3
(Control and film groups’ attitudes towards disabled people)
Wilcoxon Mann-Whitney Sign test
Show answer
Mann-Whitney
Now conduct the appropriate test on each data set and give a full report of the result including: T or U, appropriate N values, p value (either exact or in the ‘p less than …’ format) and effect size.
Show answer
Scenario 1: (Wilcoxon)
The differences between time taken to solve anagram in the noisy room and time taken in the quiet room were ranked according to size for each participant. A Wilcoxon T analysis on the difference ranks showed a rank total of 139 where noisy room times were higher than quiet room times and a rank total of 71 where quiet room times were higher. Hence, quiet rooms times were generally lower than noisy room times but this difference was not significant, T (N = 20) = 71, p = .204. The estimated effect size was small to moderate, r = 0.2.
Scenario 1: (Sign test)
For each participant the difference between noisy room and quiet room time was found and the sign of this difference recorded. The 13 cases where quiet room score was less than noisy room score were contrasted with the 7 cases where the difference was in the opposite direction using a sign test analysis. The difference was found not to be significant with S (N = 20) = 7, p = .263.
Scenario 3: Mann-Whitney
The children’s disability attitude scores were ranked as one group. The rank total for the control group was 339.5 whereas the total for the film trained group was 480.5. Using a Mann-Whitney analysis significance was very nearly achieved with U (N = 40)= 129.5, p = .056. The effect size was moderate, r = 0.3
NOTE: If using later versions of SPSS and not the ‘Legacy dialogs’ you’ll find the higher value of U is reported. Just to check. SPSS gives U = 270.5. Total ranks = N1 X N2 = 400. Subtracting 270.5 from 400 gives 129.5 which is the answer given here. It is conventional to use the lower value and to check this in tables which usually use lower values.
Exercise 17.3
Have a go at this short quiz to test your understanding and identify any gaps in your knowledge.
Weblinks
Testing for differences between two samples
Don’t forget this site for the link between effect size and power (as given in Chapter 16 too).
https://rpsychologist.com/d3/nhst
The link to G*Power.
www.psycho.uni-duesseldorf.de/abteilungen/aap/gpower3/download-and-register
A comprehensive statistics site for calculating t tests and way beyond:
http://vassarstats.net/index.html
An interactive page for seeing the implications of different sizes of Cohen’s d: