Liam
Healy & Associates
chartered occupational psychologists
Bespoke Psychometric Test Construction
We are qualified psychometric test developers and
have developed a range of bespoke interest, ability
and personality assessments for clients. Test
development takes months, not days. It represents a
significant time and financial investment for an
organisation.
Commissioning us to develop your own in-house test
has a number of advantages:
- It allows you use a psychometric test
which just focuses on those traits which are
relevant to you.
- The administration, interpretation and
reporting procedures can be customised to streamline
with your wider HR processes and functions.
- You own the copyright.
A psychometric test looks deceptively simple,
after all - it's just a list of questions and
answers, isn't it? Before contacting us to enquire
about bespoke psychometric test development, you
should consider the following :
Designing a Psychometric Test
It is a simple enough task to write a 'test' that looks
like a test - to the untrained eye it may look plausible
enough. However, the quality of a test is determined by
its psychometric and scaling properties, and not what
the test items look like. Our tests are developed
according to the guidelines laid down by Kline (1995)
and other experts in the test development
field.
Is it Easy to Write a Psychometric Test?
It is easy to write a poor quality one. If you
have ever seen an aptitude test you may well have
thought ‘This looks simple enough, I could write one of
these!’ On the first point you would be correct, they
do indeed look simple to produce, but the illusion of
simplicity ends there. It can take months or years to develop a
single basic level test, and can involve hundreds, and
even thousands of trial subjects.
Why is There So Much Involved in Developing a
Test ?
A properly designed and constructed test must have
certain technical properties. A test item will only be
included once it has passed through a number of
stringent quality control processes. There are
tests produced by unqualified developers which do not reach
these standards, but the major test publishers have
worked closely with the various
professional associations, the main one being the
British Psychological Society, to produce standards to
which test development will ideally adhere.
Here are some considerations to take into account :
1. Degree of Difficulty.
If all of the questions in a test were so easy that
everyone could answer them correctly, or so difficult
that no-one could answer them correctly, then that test
would tell us nothing about the differences between
people on the particular ability or aptitude the test
was measuring.
We need to avoid this, since in
selection and recruitment particularly, it is those very
differences in which we are interested. So
initially a very
large number of test items are written, often
five or ten times as many as will eventually be used.
This is because most individual test items will fail
this crucial test : an equal number of people must get a
test question correct and incorrect. Several hundred people need to complete each test question before we can
determine if this is case.
2. Degree of Accuracy (Reliability)
Test developers agree that tests and test use are prone to error. This
error can come from the test itself e.g. poorly written
or easily misunderstood items, or from the test
administration process e.g. instructions not being
adhered to properly, or time limits not being followed.
This error affects how consistently
the test will measure the characteristic it is designed
to measure. This is known as Reliability. The more
reliable the test then the more stable and accurate it
is. Think of reliability like a ruler. If the ruler is
made from wood, then one would not expect the
measurements of length it provided to vary much. If we
measured something such as a person’s height one day,
and then measured it again the next day with the same ruler we
would expect a high degree of correlation between the
two different measurements. If on the
other hand the ruler was made of rubber, then we would
see a large variation from one day to the next. The
ruler might measure the same person's height (which we know has
not had time to change) on two separate occasions and produce two very different
values. The exact same principle applies to tests, they
have to be stable and accurate before they can be
used.
The easiest way to establish whether a test
possesses reliability is to administer it to a
group of people, and then a few weeks later administer
it to them again. If the test is stable and reliable we
would expect to see a high positive correlation between
the scores obtained on the two different occasions (the
accepted criterion is r=0.7 or higher for the
mathematically inclined). This notion of
reliability again boils down to the quality of the test
questions.
3. Degree of Relevance to the Job
(Validity)
The fundamental question here is this - ‘Does the test
measure what it claims to measure ?’. This may seem like
a strange thing to ask. Surely, a test which contains
numerical problems which a person is required to solve
is measuring numerical ability? This is not necessarily
the case.
Many tests rely on a person having good verbal
comprehension in order to complete them successfully - even numerical or abstract reasoning
tests if the instructions are complex. In these cases, although
the test claims to measure numerical ability, and the
employer may well interpret the test scores in that
light, they may well be overlooking the fact that the
test is to some unknown extent a measure of verbal
ability. This is quite a challenge to overcome during
the test development process and can be a common
source of indirect discrimination.
This concept of whether
a test measures what it claims to measure is known as the
‘Validity’ of a test, and it is most often established
by examining statistically the degree of correlation
between one test, and another established test of
the same characteristic or ability. In the case of
validity the accepted degree of correlation is r=0.3 or
higher.
4. Establishing a Benchmark Against
Which to Interpret test Scores (Norms)
If you scored 25 on a
test what would you think ? Would you think that was a
good score, a poor score, or average ? In truth, a
standalone 'raw score' like that means nothing because you
have no context in which to interpret it. If you
knew your
score was 25 out of 100 what would you think then? You might
think that was not such a good score. But why ? If the
test was particularly difficult, your score might be
amongst the best.
If you then discovered your
score was 25 out of 100, and that this was an average
score, then you might now know what how good or bad it
was– but not
quite. There is still one piece of information missing.
You now know your score is average, but average
compared to whom ? 16 year old school leavers ?
graduates ? You still do not have a definite idea of
what your score means.
If you finally discovered that your
test score was 25 out of 100, and that this was average
compared to the scores of graduates on
the same test, you would now know what you score finally
meant.
This is what happens in ability testing - a test
score is interpreted in relation to some comparison
group. The aim is to produce a set of these ‘norm
groups’ to enable the employer to make comparisons of a
candidate’s test score with the performance of a known
group of people. Norm groups take a long time to
produce, and much of the work is done by test publishers
prior to releasing the test, although it is a never
ending task and norm groups are constantly being
updated.
5. Fairness and Discrimination in Test
Use
As well as a formal legal requirement that a test
should not unfairly discriminate against particular
ethnic or gender groups, there are ethical and practical
reasons why employers should use tests that are fair.
We know that males and females, and different ethnic
groups, all have the same overall level of intellectual
ability. This means that if a test systematically
suggested that men scored lower than women on an ability
or aptitude test, and those test results were used to
select candidates for a job, a disproportionate number
of women would be selected. This would be fine if
what the test scores suggested about men i.e. that they
had less intellectual ability than women, was true, but
it is not. We would find that the higher levels of work
success predicted by the test in the case of women would
not be found.
The purpose of test is to discriminate,
but only between people who have differing levels of
ability on the characteristic in question, and not on
the basis of irrelevant characteristics such as gender.
Consequently, ability and aptitude tests need to be
carefully constructed and statistically analysed to make sure that they
do not discriminate between people on anything other
than the actual ability or aptitude in question.
With
personality tests, this is much less of an issue because
we know that differences in personality test results may
well reflect real differences between males and females. For
instance, females tend to be reported as being more
sensitive to other peoples’ feelings and more socially
oriented than men. Remember that personality is not the
same thing as ability, so the issue is much less
contentious.
A Standard Development Process Involves
- Defining the trait or characteristic to be
measured in psychological
terms.
- Item writing - writing good test items is
exceptionally difficult, the guidelines used are
based on published research. A lot of items need to
be written initially. Perhaps only 10% of the
initial items will pass the various quantitative and
qualitative quality control processes and make it into
the final test.
- Response format - this needs to be chosen on the
basis of the test function, and must avoid range
restriction and allow the
analysis of data at the Scalar
level - in practice we deal with
Interval data rather than
Ratio data, as there is no
absolute zero value in
ability/personality assessment.
- Trial group choice - this generally needs to be
screened and stratified.
- First trial - Standard Item Analysis
is carried out (see any good textbook on the subject
for P = * value cut-offs, mean and SD parameters) to
reduce the number of items.
- Second Trial - item analysis repeated, and a
First Order Factor Analysis
carried out. Oblique or
Orthogonal Exploratory Factor Analysis
is used depending on the characteristics being
measured. Second Order Factor Analysis
is also carried out at this stage to establish
the macro structure of the characteristic being
measured and to ensure internal scale coherence.
- Finally Standardisation
and Reliability Analysis
are carried out. The Alpha Coefficient,
or Cronbach's Alpha, is the most widely
used reliability analysis method. The
standardisation and normative data production is
straightforward, and done according to accepted
methods. In calculating reliability, further items
may be moved, or removed from certain scales so that
all of the items in a particular scale contribute
to that scale's reliability - in other words, the
degree to which it is free of error.
- Finally, the administration, scoring and
reporting functions are finalised, and
organisational users trained in it's use.
- One more important activity also occurs at this
stage - we decide upon success criteria, or the
standards against which the predictive value
(validity) of the test will be subsequently
measured, and the analysis method to be used.
- Now your test is ready to use - but the
development of the test will be an ongoing task with
norms being added and updated as data from test
takers is amassed.
|