"The
Making of a Scientist"
by
Dr. Anna Rowe
Dodd, Mead, & Company
New York
1952, 1953
A
Book Review
by Robert N. Seitz, Ph. D.
July
15, 2001
Introduction
Dr. Rowe's "The Making of a Scientist" is a widely-cited
study of 60 eminent U. S. scientists whose careers, as reviewed in this
book, fell within the first half of the twentieth century. These would
have been men who were born primarily in the interval between 1890 and
1920. What's most widely quoted about her study were the scores she
obtained on "IQ tests" that she administered to them. She gave
them three types of tests: verbal, mathematical, and spatial. She then
defined three different kinds of "IQ" for the three tests: a
"verbal IQ", a "mathematical IQ", and a "spatial
IQ". I've put "IQ" in parentheses because the
measurement of adult IQ's above, perhaps, a ratio IQ of 160 has been a
Holy Grail of psychometrics. I wondered what kind of IQ test Dr. Roe found
that would take these men's measures at levels at or above a "verbal
IQ" of 177, and up to a "mathematical IQ" of 194. To make
matters more extraordinary, she gave her 39-question mathematical test
"only to the biologists and the social scientists". She
continues, "I tried it on a few of the physicists just to see. It
bothered one of them, but the others sailed right through, making an
occasional careless mistake. The test was obviously not difficult enough
for them and a waste of their time."
The median "verbal IQ" for them was 166, with
a range from a "verbal IQ" of 121 to a "verbal IQ" of
177. She observes that the 177 high score on this test is probably less
than the high scorers could have gotten on a test with a higher ceiling.
The 177 "verbal IQ" represents a raw score of 75 out of 79 test
items, so the highest scorers were bumping their heads on the ceiling of
this test, which would probably have been in the low 180's.
Their highest score on the spatial test was a
"spatial IQ" of 164, with a median IQ score of 137. The score of
164 corresponds to a raw score of 22 out of the 24 questions on the test,
so again, the testakers were bumping their heads on the ceiling of the
test.
Their "mathematical IQ's" ranged from 128 to
194, with a median "mathematical IQ" of 154. The
"mathematical IQ" equated to a raw score of 27 out of 39
questions on the test, so this test had plenty of headroom.
So let's see now. It has been argued that although
children's IQ's fail to fit a Gaussian distribution, this is because
children have unequal mental growth rates, but adults fit a Gaussian
distribution curve. Now an IQ of 194 (standard deviation of 16) occurs
with an expected frequency of 1 in 500,000,000 people (the
99.9999998th-%tile), or if her tests had a standard deviation of 15, then
with an expected frequency of 1 in 5,000,000,000 (the
99.999999998th-%tile). At least two of Dr. Rowe's biologists and
psychologists scored at the 194 level on this test, with a raw score of 27
out of 39. Given a standard deviation of 16, that would have made them
arguably the two brightest mathematical minds in Western civilization in
1952--or whenever they took Dr. Rowe's test. But wait! The biologists and
the social scientists are the second string! The physicists "sailed
right through, making an occasional careless mistake." So what does
that make their mathematical IQ"s? 220? 240? Reason reels! So you can
see why I was intrigued with the details of these widely quoted IQ values.
Dr. Roe might have argued that these scores were only
one of three factor-test results whose scores would have to be combined to
yield something like a general IQ score. But even so, the idea that 40% to
50% of the world's reservoir of mathematical aptitude was concentrated in
the U. S. in 1950 or 1951 is a little hard to swallow. Then when you crank
in the fact that these were just the runners-up---that the first team was
well ahead of this group---you realize that either the test didn't measure
what it was purported to measure---namely, a deviation iQ of 194---or the
194 was a ratio IQ, or both. (The 194 "mathematical IQ" score
was achieved by solving only 69% of the problems on the test so there was
still plenty of ceiling left for the physicists who "sailed right
through, making an occasional careless mistake". How could she have
presented this story with a straight face?) What I don't understand is why
neither she, nor apparently anyone else, ever challenged or investigated
these claims. To me, this is what science is about, and how science
advances. Nor are Dr. Roe's incongruous numbers unusual. Over the next few
weeks, I'll be highlighting several other areas that also seem to me to
raise significant questions, and to stake out needed studies (and perhaps,
opportunities for discovery, or at least, for clarification) vis-a-vis
what's going on in this field of intelligence testing.
To give a brief example, the Flynn Effect is impossible
if IQ is primarily determined by heredity.
To compound the issue, the Flynn Effect has occurred
only, or virtually only to fluid g and not to vocabulary, arithmetic, or
general information. Scores on the Raven Progressive Matrices have risen
in England by 47 points over a 90 year interval, or by 63.5 points
(121/74) as measured by the IQ tests of 1900, had they existed, or
by 39 points viewed from 1990. In other words, a sample of average (IQ =
100) test takers born in 1967 and taking the RRM in 1990 would have
gotten a 163.5 IQ on the RAPM. Conversely, someone born in 1877 taking the
RPM today would have scored at the same level as one of today's
borderline retarded with an IQ of 61. But this would only occur in tests
assessing pattern recognition and the eduction of relationships. The
Britisher born in 1877 would have had a vocabulary, an arithmetic
capability, and a fund of general information approximately equal to
today's Britisher 1967 who scores so dramatically higher on inferential
and pattern discernment tests than would his equivalent born in 1877. But
in a homogenous cultural milieu such as Britain, there should be no major
difference between fluid intelligence (g or Gf) and crystallized
intelligence scores. Given a common culture, the person with the higher g
will learn more and will use it more proficiently than the person with a
lower g score. Crystallized intelligence will closely track fluid
intelligence. Dr. Arthur Jensen, in "The g Factor", pg. 124,
says of this,
"Gf and Gc typically emerge as higher order
(usually second-order) factors in any large collection of tests given to a
highly heterogeneous subject sample in terms of educational or cultural
background. In factor analyses based on groups that are quite homogeneous
in these respects, such as schoolchildren of the same age and
socio-cultural background, Gf and Gc often are not clearly
differentiated and amalgamate into a single general factor. But in the
general population, Gf and Gc are clearly discerned, and the distinctions
that Cattell makes between them are valid. The major exception is
Cattell's prediction that the heritability of Gf is greater than that of
Gc. Although this may be true in linguistically or culturally samples for
which some of the Gc-loaded tests may be inappropriate or culturally
biased measures, the usual finding is that Gf and Gc have about the same
heritability. In fact, the heritability of scores on scholastic
achievement tests is about the same as that on the best tests of Gf. In
terms of Cattell's investment theory, one could say that persons'
standings on tests of Gc quite closely reflects that amount of Gf they had
to invest in the kinds of content that typically compose highly Gc-loaded
tests."
In other words, Gf shouldn't be able to rise without a
corresponding rise in Gc---and certainly not by 63 points. But it has.
Something must be wrong somewhere in this chain of logic.
In fairness to Dr. Roe, it should be mentioned that she
says (pg, 159),
"I was not particularly concerned at the outset
over the fact that I had no norms for this test. That is, I had no idea
what any other population would do on the test. I just assumed that
eminent scientists were extremely bright people, and I did not
particularly care just how bright they were. What I wanted to know was
whether there was a pattern in the relative standings on these tests for
any group, and if so, how these patterns compared. That is, I wanted to
know if one group of scientists tended to be relatively high on one test
and relatively low on another. The tests, though of different factors and
different numbers of items, could be compared directly for any person or
group, by converting the raw score (in this case the number of items
answered correctly) into what is known as a standard score. The name
refers to the standard deviation, a statistical measure which is useful in
computing the score. This score gives you the position of the subject with
respect to the average and distribution of all the scores in his group. If
his standard score is 0, it means that he scores exactly at the average of
his group; if his standard scor is -.05, it means that he is at such a
position below the average of the group that only one-third of the group
got a lower score; if his standard score is +1.0, it means that only
one-sixth of the group scored higher than he*."(The standard score is
the score measured in standard deviations... e. g., 2 standard
deviations.)
"* This assumes a standard distribution of scores on the test, which
will be near enough the case on this type of test with a large
group."
What? These are precisely the very-high IQ's which
deviate dramatically from a normal distribution. This assumption seems to
me to beg the question.
She also writes (pg. 156),
"It was assumed for some time that a person's IQ
was a fixed part of him, like complexion or eye-color, and that, except in
extraordinary instances, it did not change. This was what was called the
constancy of the IQ. The term was first used by Terman and although he
pointed out at that time that about half of the children he examined had
shown changes in IQ (from 1 to over 20 points) this tended to be
overlooked.
"Our ideas on the nature of intelligence and on
the constancy of the IQ have changed. There is, however, more agreement on
the latter point than the former. We know now that the IQ is not constant
in the sense we used to think of it, but that there are many things that
may affect it, and that particularly in the very early years, we cannot
effectively predict what any individual's IQ will be 10 years later. On
the other hand, by the age of 7 or 8, we can get about as good an estimate
as we are ever likely to, but we cannot be sure that environmental or
emotional influences won't alter it to a greater or lesser extent. Shifts
after that time, however, are under most circumstances sufficiently small
that the measurement of intelligence is a very useful technique."
Seeking a test that had sufficient ceiling to test her
60 eminent scientists, Dr. Roe says,
"I could find none that seemd to me to be
difficult enough for the group I proposed to test. In psychological
jargon, they did not have enough ceiling."
It's interesting to note that she must have ruled out
the CMT-A and the CMT-T, as well as the Wechsler-Bellevue test. (The
latter is only recommended for IQ's up to two standard deviations above
the mean, although its official ceiling is 3 2/3rds standard deviations
above the mean, and psychometrists sometimes extrapolate its results to
scores even higher than that.) She continues,
"I took my problem to the Educational Testing
Service... After some sonsultation, they pulled out a lot of difficult
items from their files and made up the verbal test. The spatial test is
part of another test, and the mathematical test is an abbreviation of a
special test they constructed for one of the military services during the
war. All were given with arbitrarily set time limits."
The Educational Testing Service tends to score its
tests upon a percentile basis. Converting percentile scores to IQ scores
by reading them off a Gaussian normal distribution makes the implicit
assumption that IQ's are normally distributed. But they're not, nor are
childhood and adult heights.
Childhood IQ's and Non-Gaussian Distributions
Both childhood and adult heights are approximately
normally distributed near the average adult U. S. male height of 5'
9", but extreme heights occur at a far higher rate than a normal
curve would predict. A normal curve predicts that an adult male height of
5 feet or less should be expected to occur about once in every thousand
men. It predicts that an adult male height below 4 feet should occur at a
rate of only one per billion, and an adult male height below 3 feet would
be absurdly impossible.
A similar situation exists with respect to children's
IQ's (and perhaps with adult IQ's as well). Only one IQ of 200 or above
should occur on this planet. In practice, IQ's of 200 occur about once
among every 500,000 children. For example, one child with an IQ of 200 was
identified in the 1921-1922 Terman Study during the screening of 250,000+
California schoolchildren. More to the point, they would have expected,
perhaps, one child with an IQ of 170. Instead, they turned up 77 of them!
An IQ of 180 or above has an expected frequency of occurrence of 1 in
3,500,000, so they wouldn't have expected to find any IQ's of 180+ during
the Terman screening. In fact, they unearthed 26 of them, or about 300
times the number they would have expected. Among her 12 children with IQ's
above 180, Leta Hollingworth found one with an IQ of 199, (Child L),
one with an IQ of 200 (Child K), and one with an IQ of 200+ (Child F).
Four children with IQ's of 200+ were found in the 1940's among the Quiz
Kids who lived in the greater Chicago area, Richard Williams, Joel
Kupperman, Lonnie Lunde, and Ruth (Duskin) Feldman. Miraca Gross found
four in her study of the severely gifted in Australia, including one,
Adrian Seng, with an IQ of 220. Marilyn vos Savant's 10-year-old IQ of 228
and other IQ's that are significantly above 200 would
be totally impossible if IQ's were distributed in strict accordance
with a Gaussian normal curve.
Bottom Line:
Extreme heights, and extreme IQ's, occur much more
frequently than a normal bell curve would predict.
Getting back to our book, Dr, Roe continues,
"I made an attempt to get some graduate students
to take the same test, just as a matter of general interest, but succeeded
in getting only 10, and under circumstances which made it impossible to
judge how they had been selected. I then dropped the idea of getting any
comparison group... ... ... I had the great good fortune then to meet an
old acquaintance, Dr. Irving Lorge, who came to my rescue and arranged to
give the test to all students matriculating at Teachers College, Columbia
for a Ph. D. that February. All of their Ph. D. students have to take a
battery of tests.This test would be included in the battery. Since the
other tests had been well standardized it would then be possible to draw
up tables of equivalents by which scores on the VSM could be converted
(within certain limits of assurance) to scores on these other tests. This,
incidentally, upset my budget considerably."
For me, three questions arise.
(1) Wouldn't the other tests in Columbia's battery of
Ph.D. exams be subject-matter preliminary exams? Would they have included
an IQ test? If so, what IQ test would have had sufficient headroom to
properly encompass those Ph. D. students?
(2) How could she have drawn up tables of equivalents
that would have allowed her to convert scores on her VSM to scores on
other tests? What tests could have gone high enough?
(3) Could these other tests have been
school-administered IQ tests? If so, what about regression to the mean?
And above all, we're given no real information about this validation of
her VSM. How many Columbia Ph. D.students took the VSM?
The Verbal Test
She continues, explaining that the verbal test
consisted of 80 items, but one was dropped because it didn't discriminate
at all well, leaving 50 questions in the first section, and 29 questions
in the second section. In the first section, out of four words, you were
to pick out the two that were most nearly oposite in meaning. Here is one
of the questions: 1. Predictable 2. Precarious 3. Stable
4. Laborious
"In the second section, the task was the same, but
it was presented a little differently This time, one of the opposites was
given and the task was to to pick one of five other words which was most
nearly opposite to the first one. Here is an example of that group."
ABSOLUTE: 1-forget 2-usurp
3-absolve 4-utilize 5-limit
The lowest scores were made by the experimental
physicists, with a range of 8 ro 71, and an average score of 46.6, the
lowest of any of the groups. The highest scores on this test were made by
the theoretical physicists, with a range of 52 to 75, and an average of
64.
IQ's of Various Collegiate Groups
"Now let us look at IQ's of college populations of
today. Embree found that 1,200 high school graduates who went to college
had been found during childhood to have a median IQ of 118; those who
graduated with a B. A., an IQ of 123. Honor graduates had a median IQ of
133 and those elected to Phi Beta Kappa of 137. The range of IQ's for all
of those who received degrees was from 95 to 180. For persons who went on
to take a Ph. D., Wrenn found a median IQ of 141.
" It is clear, then, that so far as verbal ability is
concerned these eminent scientists are on the average higher than the
general run of those who get Ph. D's, but, and this is very important,
some of them are not as high as the average Ph. D. It is, then, not
essential to have this ability at the highest level in order to become an
eminent scientist. That it is doubtless a great help is another
matter, but it should be remembered that it is less helpful in some fields
than in others."
The Spatial Test
The spatial test consisted of 24 items, with 20 minutes
to solve them. Depicted below are three practice for this test.
Once again the theoretical physicists led the pack, although this time,
the experimental physicists were right behind them. As mentioned above,
scores ranged from "spatial IQ's" of 123 to 164, with a median
score of 137. One interesting and unexpected discovery was that the
"spatial IQ's" of the 60 eminent scientists declined
significantly as a function of increasing age. The correlation coefficient
was -0.40. Reviewing it today, we might say that these declines in
"spatial IQ's" reflected declines in fluid g, or the Flynn
Effect, or both.
The Mathematical Test
Dr. Roe explains that the mathematical test "was
taken from one which the Educational Testing Service had developed for a
special project. The original was too long for my purposes, so we selected
portions of it, omitting some of the easiest items, and then deleting
other items of varying levels of difficulty. There were 39 items in the
final form, and 30 minutes was allowed for work on it. The items were
generally of the type known as mathematical reasoning, and an example is
given below.
"Select the correct answer:
"If x + 3 y = 7 x + 5 y, x/y = ?
"(A) -3 (B) -1/3 (C)
-1/9 (D) 1 (E) 3.
"If you did it properly, you underlined A."
Whoops, Dr. Anna! The correct answer is B (-1/3), as we can see by solving
the problem. Subtracting x + 3 y from both sides of the equation gives us,
0 = 6 x + 2 y.
Dividing both sides by 2 yields 3 x + y = 0, or 3 x = -y, or x = -y/3 or,
dividing both sides by y, x/y = -1/3.
Dr. Roe says,
"Let us look at the equivalents on this test. It
is not correlated with age (the correlation coefficient is .00). The
lowest score on this test is about equivalent to an IQ of 123, the median
score is an IQ of 154 and the highest to an IQ of 194. That is very high
indeed. Mathematical ability is certainly important for work in physics,
but it seems it can also be important in some other sciences, particularly
biology and psychology. The two highest score attained by biologists were
made by geneticists."
"If we examine the correlations between the tests,
we see that it is true in this instance that ability to do the
mathematical test is not related to ability to do either of the other
tests. The correlations are .14 and .21 which are not significant with
these groups. There is, however, some correlation between the spatial and
verbal tests. The coefficient is +.33, and that is high enough to indicate
that some relationship exists. It is not close, but it points up one of
the difficulties with the spatial test. That is, it can be done in
different ways, Those who do it extremely well do it for the most part
without much conscious reasoning about it. They can tell the answer 'just
by looking' at the figures and imagining them turned around in various
ways. Some of the others, however, are able to do it fairly well by
talking to themselves about it, and it is through such circumstances, I
think, that the relation with the verbal test comes in."
To me, Dr. Roe's "verbal IQ's", "spatial IQ's", and
"mathematical IQ's" are being quoted and bandied about as though
they were gospel, whereas in reality, there's fine print associated with
them that, I think, has gotten lost over the years and in the translation.
Also, as I read the tea leaves, something doesn't add up in the original
IQ numbers.
