Liam
Healy & Associates
chartered occupational psychologists
Reliability of Selection and Assessment Tools : A
Discussion
In the
context of assessment Reliability
has a specific technical definition and meaning which is
quite different from the meaning it has in everyday use.
Any observed assessment score
consists of a
true score (which we can never accurately know) and
some measurement
error. Different assessment tools will measure true
scores with
different degrees of accuracy, the term we use to refer
to this is
Reliability.
We can
define
Reliability as the accuracy with which a tools can
measure true scores.
If there were no measurement error,
then the scores we obtained would be perfectly reliable
and would always represent true scores i.e. they would measure whatever it
was they were measuring with total accuracy.
There are a number of ways of
estimating reliability, they are all based on
correlation. Because a reliability coefficient is
essentially a correlation coefficient it can only have
value between zero and plus one where zero means no
reliability and plus one means perfect reliability
e.g. The reliability of this measure is r =
0.60. Note
that a measurement scale can be reversed in the case of
a negative figure.
There are three main methods of
estimating reliability:
test-retest
measures,
alternate form measures and
internal
consistency measures. Each of these methods is
sensitive to a different source of error and so each
will produce slightly different reliability estimates.
What Values for Reliability are Acceptable?
Some argue that an acceptable reliability coefficient
is 0.70 or more, but there are no straightforward
definitions of what it should be, but this may give you a
guide of what sort of values to expect.
For stable traits, short-term equivalence (two week
test-retest) between alternate-forms of the same measure
should be in the range 0.65 - 0.80, the same-form retest
should be in the range 0.75 to 0.90. These values may be
lower in the case of state measures which assess states
as opposed to stable traits. e.g. anxiety (although
anxiety can be a trait as well).
Trait measures (mainly personality instruments) are
less stable than abilities and for a two year retest
coefficient we would expect a value of 0.40 - 0.50
whereas we would expect a value of 0.60 or better for an
ability measure. Internal consistency reliability
values depend not only on the breadth of the
characteristic being measured, but also on how well the
measurement tool
items sample it. The higher the value obtained then the
narrower we would expect the construct being measured
to be. We should treat very high values (greater than
0.95) and very low values (0.70 or less) with caution.
Reliability is also affected by the number of people in
the sample on which the reliability estimate is based.
The larger the sample then the smaller the error
surrounding the reliability estimate will be.
The longer a measure is i.e. the more items it has,
then the more accurately it will sample the domain being
measured. As the length increases then so should the
reliability, but only if the questions are still sampling
the domain in question.
The Correlations
between Scale Scores and Reliability
As
the correlation between two scales changes then so does
the reliability of their sum and their difference. As
the correlation between two scales (predictor score and
success criterion score) increases:
For
sum scores (and difference scores when correlation
decreases)
·
Variance and reliability
increase
·
A decreased amount of the
variance is accounted for by error
For
difference scores (and sum scores when correlation
decreases)
·
Variance and reliability
decrease
·
An increased amount of the
variance is accounted for by error
Remember that a strong correlation
between two measures means that they overlap to some
degree. With a difference score, the more overlap there
is then the less difference there will be between the
two scores. With sum scores, we are in effect adding the
two measures together to produce one long test. If the
two measures are similar then it takes on the
psychometric characteristics that give tests their
reliability, and so as correlation increases so will sum
score reliability.
We also need to consider Range
Restriction. If two samples of
people took the same measure and one sample produced
a set of scores that covered the whole range of
possible scores (low through to high), while the
other produced a set of scores which covered a
narrower range then the variance of the first set
would be greater than that of the second.
The term we would use to describe what has happened
to the second set of scores is Range Restriction.
Range restriction can happen by chance or because of
some bias which has been introduced to the process. The
commonest source of range restriction is something that
happens all the time in personnel selection - the
practice of basing selection on the top 10% (or some
other proportion) of scorers. In this case we will find
that the variance of scores is reduced because the top
10% of scores would all be quite high.
For the statisticians out there, we can use the
following formula to correct for range restriction:
R1
= 1 - [(SD22
/ SD12)
x (1 - R2)]
Where
-
R1
=
Reliability corrected for range restriction
-
SD2
=
SD of the restricted sample
-
SD1
=
SD of the unrestricted sample
-
R2 =
Reliability of the restricted sample
|