Chapter 16 – Significance testing – was it a real effect?
This chapter is about the principles of significance testing.
Exercises
Exercise 16.1
One- or two-tailed tests
In each case below decide whether the research prediction permits a one-tailed test or whether a two-tailed test is obligatory.
1.There will be a difference between imagery and rehearsal recall scores.
Show answer
Two-tailed
Explanation: Direction of the difference was not predicted.
2. Self-confidence will correlate with self-esteem
Show answer
Two-tailed
Explanation: Direction of the correlation was not predicted.
3.Extroverts will have higher comfort scores than introverts
Show answer
One-tailed
Explanation: Direction of difference was predicted but please note that in most research a two-tailed test is preferred.
4. Children on the anti-bullying programme will improve their attitude to bullying compared with the control group
Show answer
One-tailed
Explanation: Direction of difference was predicted but please note that in most research a two-tailed test is preferred.
5. Children on the anti-bullying programme will differ from the control group children on empathy
Show answer
Two-tailed
Explanation: Direction of the difference was not predicted.
6. Anxiety will correlate negatively with self-esteem
Show answer
One-tailed
Explanation: Direction of correlation was predicted but please note that in most research a two-tailed test is preferred.
7. Participants before an audience will make more errors than participants alone
Show answer
One-tailed
Explanation: Direction of difference was predicted but please note that in most research a two-tailed test is preferred.
8. Increased caffeine will produce a difference in reaction times
Show answer
Two-tailed
Explanation: Direction of the difference was not predicted.
Exercise 16.2
Have a go at this short quiz to test your understanding of significance testing and identify any gaps in your knowledge.
Exercise 16.3
z values and significance
In the chapter we looked at a value of z and found the probability that a z that high or higher would be produced at random under the null hypothesis. We do that by taking the probability remaining to the right of the z value on the normal distribution in Appendix table 2 (if the z is negative we look at the other tail as in a mirror). Following this process, in the table below enter the exact value of p that you find from Appendix table 2. Don’t forget that with a two-tailed test we use the probabilities at both ends of the distribution. That is we just double the value found for one end. Enter your value with a decimal point and four decimal places exactly as in the table. Decide whether a z of this value would be declared significant with p ≤ .05
z value | One or Two tailed | p = | Significant? | |
a | 0.78 | One | ||
b | 1.97 | Two | ||
c | 2.56 | Two | ||
d | -2.24 | Two | ||
e | 1.56 | One | ||
f | -1.82 | Two |
Show answer
z value | One or Two tailed | p = | Significant? | Feedback | |
a | 0.78 | One | .2177 | No | .2177 is not less than .05 as it needs to be for significance |
b | 1.97 | Two | .0488 | Yes | .0488 just gets under .05 |
c | 2.56 | Two | .0104 | Yes | .0104 is lower than .05 |
d | -2.24 | Two | .0250 | Yes | .025 is lower than .05 |
e | 1.56 | One | .0594 | No | .0594 is higher than .05 |
f | -1.82 | Two | .0688 | No | .0688 is higher than .05 |
Weblinks
Significance testing – was it a real effect?
A YouTube video with simple explanation of p values, Type I error, effect size and statistical vs. practical significance.
What Statistical Significance Means – Part 1 (8-11) – YouTube
See the relationship between power, sample size, Type I and Type II error by sliding these values on a normal distribution.
https://rpsychologist.com/d3/nhst
An article by Masicampo and Lalande (2012) showing that a larger than expected number of findings are reported as significant at just under the 0.05 level.
Further Information
Sod’s law – or Murphy’s law as the Americans more delicately put it
A discussion of Sod’s law – a BBC spoof documentary about testing the notion that toast always falls butter side down and other issues.
Do you ever get the feeling that fate has it in for you? At the supermarket, for instance, do you always pick the wrong queue, the one looking shorter but which contains someone with 5 un-priced items and several redemption coupons or with the checkout clerk about to take a tea break? Do you take the outside lane only to find there’s a hidden right-turner? Sod’s law (known as Murphy’s law in the US), in its simplest form states that whatever can go wrong, will. Have you ever returned an item to a shop, or taken a car to the garage with a problem, only to find it working perfectly for the assistant? This is Sod’s law working in reverse but still against you. A colleague of mine holds the extension of Sod’s law that things will go wrong even if they can’t. An amusing QED (BBC) TV programme (Murphy’s Law, 1991[1]) tested this perspective of subjective probability. The particular hypothesis, following from the law, was that celebrated kitchen occurrence where toast always falls butter side down – doesn’t it? First attempts engaged a university physics professor in developing machines for tossing the toast without bias. These included modified toasters and an electric typewriter. Results from this were not encouraging. The null hypothesis doggedly retained itself, buttered sides not making significantly more contact with the floor than unbuttered sides. It was decided that the human element was missing. Sod’s law might only work for human toast droppers.
The attempt at more naturalistic simulation was made using students and a stately home now belonging to the University of Newcastle. Benches and tables were laid out in the grounds and dozens of students asked to butter one side of bread then throw it in a specially trained fashion to avoid toss bias. In a cunning variation of the experiment, a new independent variable was introduced. Students were asked to pull out their slice of bread and, just before they were about to butter a side, to change their decision and butter the other side instead. This should produce a bias away from butter on grass if sides to fall on the floor are decided by fate early on in the buttering process. Sadly neither this nor the first experiment produced verification of Sod’s law. In both cases 148 slices fell one way and 152 the other – first in favour of Murphy’s law then against it. Now the scientists had one of those flashes of creative insight. A corollary of Sod’s law is that when things go wrong (as they surely will – general rule) they will go wrong in the worst possible manner. The researchers now placed expensive carpet over the lawn. Surely this would tempt fate into a reaction? Do things fall butter side down more often on the living room carpet (I’m sure they do!)? I’m afraid this was the extent of the research. Frequencies were yet again at chance level, 146 buttered side down, 154 up.
Murphy, it turned out, was a United States services officer testing for space flight by sending service men on a horizontally jet propelled chair across a mid-Western desert to produce many Gs of gravitational pressure. I’m still not convinced about his law. Psychologists suggest the explanation might lie in selective memory – we tend to remember the annoying incidents and ignore all the un-notable dry sides down or whizzes through the supermarket tills. But I still see looks on customers’ faces as they wait patiently – they seem to know something about my queue …
The sociologist’s chip shop
An attempt to exemplify the concepts of the null hypothesis and significance in an everyday homely tale of chips.
Imagine one lunchtime you visit the local fish and chip emporium near the college and get into conversation with the chippy. At one point she asks you: ‘You’re from the college then? What do you study?’. Upon your reply she makes a rasping sound in her throat and snaps back. ‘Psychology?!!! Yeughhh!!! All that individualist, positivistic crap, unethical manipulation of human beings, nonsensical reductionism rendering continuous human action into pseudo-scientific behavioural elements. What a load of old cobblers! Give me sociology any day. Post-Marxist-Leninist socialism, symbolic interactionism, real life qualitative participative research and a good dollop of post-modern deconstructionism’. You begin to suspect she may not be entirely fond of psychology as an academic subject. You meekly take your bag of chips and proceed outside only to find that your bag contains far too many short chips, whilst your sociology friends all have healthy long ones.
We must at this point stretch fantasy a little further by assuming that this story is set in an age where, post-salmonella, BSE and genetically modified food, short chips are the latest health scare; long chips are seen as far healthier since they contain less fat overall (thanks to my students for this idea).
Being a well-trained, empirically based psychology student, you decide to design a test of the general theory that the chippy is biased in serving chips to psychology and sociology students. You engage the help of a pair of identical twins and send them simultaneously, identically clothed, into the chip shop to purchase a single bag of chips. One twin wears a large badge saying ‘I like psychology’ whilst the other twin wears an identical badge, apart from the replacement of ‘psychology’ with ‘sociology’. (OK! OK! I spotted the problem too! Which twin should go first? Those bothered about this can devise some sort of counterbalanced design – see Chapter 3 – but for now I really don’t want to distract from the point of this example). Just as you had suspected, without a word being spoken by the twins beyond their simple request, the sociology twin has far longer chips in her bag than does the psychology twin!
Now, we only have the two samples of chips to work with. We cannot see what goes on behind the chippy’s stainless steel counter. We have to entertain two possibilities. Either the chippy drew the two samples (fairly) from one big chip bin (H0) or the bags were filled from two separate chip bins, one with smaller chips overall and therefore with a smaller mean chip length than the other bin (H1). You now need to do some calculations to estimate the probability of getting such a large difference between samples if the bags were filled from the same bin (i.e., if the null hypothesis is true). If the probability is very low you might march back into the shop and demand redress (hence you have rejected H0!). If the probability is quite high – two bags from the same bin are often this different – you do not have a case. You must retain the null hypothesis.
In this example, our research prediction would be that the sociology student will receive longer chips than the psychology student. Our alternative hypothesis is that the psychology and sociology chip population means are different; the null hypothesis that the population means are the same (i.e., the samples were drawn from the same population).
Please, sir, may we use a one-tailed test, sir?
A discussion of the arguments for and against the use of one-tailed tests in statistical analysis in psychology.
It is hard to imagine statisticians having a heated and passionate debate about their subject matter. However, they’re scientists and of course they do. Odd, though, are the sorts of things they fall out over. Whether it is legitimate to do one-tailed tests in psychology on directional hypotheses is, believe it or not, one of these issues. Here are some views against the use of one-tailed tests on two-group psychological data.
A directional test requires that no rationale at all should exist for any systematic difference in the opposite direction, so there are very few situations indeed where a directional test is appropriate with psychological data consisting of two sets of scores.
MacRae, 1995
I recommend using a non-directional test to compare any two groups of scores … Questions about directional tests should never be asked in A level examinations.
MacRae, 1995
I say always do two-tailed tests and if you are worried about b, jack the sample size up a bit to offset the loss in power.
Bradley, 1983 (Cited in Howell, 1992)
And some arguments for the use of one-tailed tests are as follows:
To generate a theory about how the world works that implies an expected direction of an effect, but then to hedge one’s bet by putting some (up to 1⁄2) of the rejection region in the tail other than that predicted by the theory, strikes me as both scientifically dumb and slightly unethical … Theory generation and theory testing are much closer to the proper goal of science than truth searching, and running one-tailed tests is quite consistent with those goals.
Rodgers, 1986 (cited in Howell, 1992)
… it has been argued that there are few, if any, instances where the direction [of differences] is not of interest. At any rate, it is the opinion of this writer that directional tests should be used more frequently.
Ferguson and Takane, 1989
MacRae is saying that when we conduct a one-tailed test, any result in the non-predicted direction would have to be seen as a chance outcome since the null hypothesis for directional tests covers all that the alternative hypothesis does not. If the alternative hypothesis says the population mean is larger than 40 (say) then the null hypothesis is that the population mean is 40 or less. To justify use of a one-tailed test, you must, in a sense, be honestly and entirely uninterested in an effect in the opposite direction. A textbook example (one taken from a pure statistics book, not a statistics-for-social-science textbook) would be where a government agency is checking on a company to see that it meets its claim to include a minimum amount of (costly) vitamin X in its product. It predicts and tests for variations below the minimum. Variations above are not of interest and almost certainly are relatively small and rare, given the industry’s economic interests. A possibly equivalent psychological example could be where a therapist deals with severely depressed patients who score very much up the top end of a depression scale. As a result of therapy a decline in depression is predicted. Variations towards greater depression are almost meaningless since, after a measurement of serious depression, the idea of becoming even more depressed is unmeasurable and perhaps unobservable.
Rodgers, however, says what most people feel when they conduct psychological projects. Why on earth should I check the other way when the theory and past research so clearly point in this one direction? In a sense, all MacRae and Bradley are asking is that we operate with greater surety and always use the 2.5% level rather than the 5% level. If we’ve predicted a result, from closely argued theory, that goes in one direction, then use two-tailed values and find we are significant in the opposite direction, we’re hardly likely to jump about saying ‘Eureka! It’s not what I wanted but it’s significant!’ Probably we will still walk away glumly, as for a failure to reach significance, saying ‘What went wrong then?’ It will still feel like ‘failure’. If we had a point to make we haven’t made it, so we’re hardly likely to rush off to publish now. Our theoretical argument, producing our hypothesis, would look silly (though it may be possible to attempt an explanation of the unexpected result).
During this argument it always strikes me as bizarre that textbooks talk as if researchers really do stick rigidly to a hypothesis testing order: think through theory, make a specific prediction, set alpha, decide on one- or two-tailed test, find out what the probability is, make significance decision. The real order of events is a whole lot more disjointed than that. During research, many results are inspected and jiggled with. Participants are added to increase N. Some results are simply discarded.
Researchers usually know what all the probability values are, however, before they come to tackle the niggling problem of whether it would be advisable to offer a one- or two-tailed analysis in their proposed research article. When the one-tailed test decision is made is a rather arbitrary matter. In some circles and at some times it depends on the received view of what is correct. In others it depends on the actual theory (as it should) and in others it will depend on who, specifically, is on the panel reviewing submitted articles.
So what would happen, realistically speaking, if a researcher or research team obtained an opposite but highly ‘significant’ result, having made a directional prediction? In reality I’m sure that if such a reversal did in fact occur, the research team would sit back and say ‘Hmm! That’s interesting!’ They’re not likely to walk away from such an apparently strong effect, even though it initially contradicts their theorising. The early research on social facilitation was littered with results that went first one way (audiences make you perform better) then the other (no, they don’t; they make performance worse). Theories and research findings rarely follow the pure and simple ideal. It is rare in psychology for a researcher to find one contrary result and say ‘Oh well. That blows my whole theory apart. Back to the drawing board. What shall I turn my hand to today then?’ The result would slot into a whole range of findings and a research team with this dilemma might start to re-assess their method, look at possible confounding variables in their design and even consider some re-organisation of their theory in order to incorporate the effect.
It is important to recognise the usefulness of this kind of result. Far from leaving the opposite direction result as a ‘chance event’, the greater likelihood is that this finding will be investigated further. A replication of the effect, using a large enough sample to get p ≤ .01, would be of enormous interest if it clearly contradicts theoretical predictions – see what the book says about the 1% level.
So should you do one-tailed tests? This is clearly not a question I’m going to answer, since it really does depend upon so many things and is clearly an issue over which the experts can lose friends. I can only ever recall one research article that used a one-tailed test and the reality is that you would be unlikely to get published if you used them, or at least you would be asked to make corrections. Personally though, in project work, I can see no great tragedy lying in wait for those who do use one-tailed tests so long as they are conscientious, honest and professional in their overall approach to research, science and publishing. As a student, however, you should just pay attention to the following things:
- follow the universally accepted ‘rules’ given in the main text;
- be aware that this is a debate, and be prepared for varying opinions around it;
- try to test enough participants (as Bradley advises), pilot your design and tighten it, so that you are likely to obtain significance at p ≤ .01, let alone .05!
- the issue of two-tailed tests mostly disappears once we leave simple two-condition tests behind. In ANOVA designs there is no such issue.
For references, please see the textbook.
[1] Sadly no longer available except via The British Film Institute