[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: RECENT MANUAL MEASUREMENTS OF OA AND OAA



On Wed, 11 Jan 2006, David Goodman wrote (in liblicense-l):

> Within the last few months, Stevan Harnad and his group, and we in our 
> group, have carried out together several manual measurements of OA (and 
> sometimes OAA, Open Access Advantage). The intent has been to 
> independently evaluate the accuracy of Chawki Hajjem's robot program, 
> which has been widely used by Harnad's group to out similar measurements 
> by computer.
>
> The results from these measurements were first reported in a joint 
> posting on Amsci,* referring for specifics to a simultaneously posted 
> detailed technical report,** in which the results of each of several 
> manual analyses were separately reported.
>
> * http://listserver.sigmaxi.org/sc/wa.exe?A2=ind05&L
> =american-scientist-open-access-forum&D=1&O=D&F=l&P=96445)
>
> ** "Evaluation of Algorithm Performance on Identifying OA" by Kristin
> Antelman, Nisa Bakkalbasi, David Goodman, Chawki Hajjem, Stevan Harnad (in
> alphabetical order) posted on ECS as http: eprints/ecs.soton.ac.uk/11689,
>
> From these data, both groups agreed that "In conclusion, the robot is not
> yet performing at a desirable level and future work may be needed to
> determine the causes, and improve the algorithm."

I am happy that David and his co-workers did an independent test of how 
accurately Chawki's robot detects OA. The robot over-estimates OA (i.e., 
it miscodes many non-OA articles as OA: false positives, or false OA).

Since our primary interest was and is in demonstrating the OA citation 
impact advantage, we had reasoned that any tendency to mix up OA and 
non-OA would go against us, because we were comparing the relative number 
of citations for OA and non-OA articles: the OA/non-OA citation ratio. So 
mixing up OA and non-OA would simply dilute that ratio, hence the 
detectability of any underlying OA advantage. (But more on this below.)

We were not particularly touting the robot's accuracy in and of itself, 
nor its absolute estimates of the percentage of OA articles. There are 
other estimates of %OA, and they all agree that it is roughly between 5% 
and 25%, depending on field and year. We definitely do not think that 
pinning down that absolute percentage accurately is the high priority 
research goal at this time.

In contrast, confirming the OA impact advantage (as first reported in 2001 
by Lawrence for computer science) across other disciplines *is* a high 
priority research goal today (because of its importance for motivating 
OA). And we have already confirmed that OA advantage in a number of areas 
of physics and mathematics *without the use of a robot.*

     Brody, T. and Harnad, S. (2004) Comparing the Impact of Open Access
     (OA) vs. Non-OA Articles in the Same Journals. D-Lib Magazine 10(6).
     http://eprints.ecs.soton.ac.uk/10207/

     Harnad, S., Brody, T., Vallieres, F., Carr, L., Hitchcock, S.,
     Yves, G., Charles, O., Stamerjohanns, H. and Hilf, E. (2004)
     The Access/Impact Problem and the Green and Gold Roads to Open
     Access. Serials review 30(4).
     http://eprints.ecs.soton.ac.uk/10209/

     Brody, T., Harnad, S. and Carr, L. (2005) Earlier Web Usage
     Statistics as Predictors of Later Citation Impact. Journal of the
     American Association for Information Science and Technology (JASIST).
     http://eprints.ecs.soton.ac.uk/10713/

For the OA advantage too, it is its virtually exception-free positive 
polarity that is most important today -- less so its absolute value or 
variation by year and field.

The summary of the Goodman et al. independent signal-detection analysis of 
the robot's accuracy is the following:

     This is a second signal-detection analysis of the accuracy of a
     robot in detecting open access (OA) articles (by checking by hand
     how many of the articles the robot tagged OA were really OA, and
     vice versa). A first analysis, on a smaller sample (Biology: 100
     OA, 100 non-OA), had found a detectability (d') of 2.45 and bias of
     0.52 (hits 93%, false positives 16%; Biology %OA: 14%; OA citation
     advantage: 50%). The present analysis on a larger sample (Biology:
     272 OA, 272 non-OA) found a detectability of 0.98 and bias of 0.78
     (hits 77%, false positives, 41%; Biology %OA: 16%; OA citation
     advantage: 64%). An analysis in Sociology (177 OA, 177 non-OA)
     found near-chance detectability (d' = 0.11) and an OA bias of 0.99
     (hits, 9%, false alarms, -2%; prior robot estimate Sociology %OA:
     23%; present estimate 15%). It was not possible from these data to
     estimate the Sociology OA citation advantage. CONCLUSIONS: The robot
     significantly overcodes for OA. In Biology 2002, 40% of identified
     OA was in fact OA. In Sociology 2000, only 18% of identified OA
     was in fact OA. Missed OA was lower: 12% in Biology 2002 and 14% in
     Sociology 2000. The sources of the error are impossible to determine
     from the present data, since the algorithm did not capture URLs for
     documents identified as OA. In conclusion, the robot is not yet
     performing at a desirable level and future work may be needed to
     determine the causes, and improve the algorithm.

In other words, the second test, based on the better, larger sample, finds 
a lower accuracy and a higher false-OA bias. In Biology, the robot had 
estimated 14% OA overall; the estimate based on the Goodman et al sample 
was instead 16% OA. (So the robot's *over*coding of the OA had actually 
resulted in a slight *under*estimate of %OA -- largely because the 
population proportion of OA is so low: somewhere between 5% and 25%.) The 
robot had found an average OA advantage of 50% in Biology; the Goodman et 
al sample found an OA advantage of 64%.  (Again, there was not much 
change, because the overall proportion of OA is still so low.)

Our robot's accuracy for Sociology (which we had not tested, so Goodman et 
al's was the first test) turned out to be much worse, and we are 
investigating this further. It will be important to find out why the 
robot's accuracy in detecting OA would vary from field to field.

> Our group has now prepared an overall meta-analysis of the manual 
> results from both groups. *** We are able to combine the results, as we 
> all were careful to examine the same sample base using identical 
> protocols for both the counting and the analysis. Upon testing, we found 
> a within-group inter-rater agreement of 93% and a between-groups 
> agreement of 92%.
>
> *** "Meta-analysis of OA and OAA manual determinations." David Goodman,
> Kristen Antelman, and Nisa Bakkalbasi,
> <http://eprints.rclis.org/archive/00005327/>

I am not sure about the informativeness of a "meta-analysis" based on two 
samples, from two different fields, whose main feature is that there seems 
to be a substantial difference in robot accuracy between the two fields! 
Until we determine why the robot's accuracy would differ by field, 
combining these two divergent results is like averaging over apples and 
oranges. It is trying to squeeze too much out of limited data.

Our own group is currently focusing on testing the robot's accuracy in 
Biology and Sociology (see end of this message), using a still larger 
sample of each, and looking at other correlates, such as the number of 
search-matches for each item. This is incomparably more important than 
simply increasing the robot's accuracy for its own sake, ot for trying to 
get more accurate absolute estimates of the percentage of OA articles, 
because if the robot's false-OA bias were to be large enough *and* were 
correlated with the number of search-match items (i.e., if articles that 
have more non-OA matches on the Web are more likely to be falsely coded as 
OA) then this would compromise the robot-based OA-advantage estimates.

> Between us, we analyzed a combined sample of 1198 articles in biology and
> sociology, 559 of which the robot had identified as OA, and 559 of which
> the robot had reported as non-OA.
>
> Of the 559 robot-identified OA articles , only 224 actually were OA (37%).
> Of the 559 robot-identified non-OA articles, 533 were truly non-OA (89%).
> The discriminability index, a common used figure of merit, was only 0.97.

It is not at all clear what these figures imply, if anything. What would
be of interest would be to calculate the OA citation advantage for each
field (separately, and then, if you wish, combined) based on the citation
counts for articles now correctly coded by humans as OA and non-OA in
this sample, and to compare that with the robot-based estimate.

More calculations on the robot's overall inaccuracy averaging across these two
fields is not in and of itself providing any useful information.

> (We wish to emphasize that our group's results find true OAA in biology at
> a substantial level, and we all consider OAA one of the many reasons that
> authors should publish OA.)

It would be useful to look at the OAA (OA citation advantage) for the 
Sociology sample too, but note that the right way to compare OA and non-OA 
citations is within the same journal/year. Here only one year is involved, 
and perhaps even the raw OA/non-OA citation ratio will tell us something, 
but not a lot, given that there can be journal-bias, with the OA articles 
coming from some journals and the non-OA ones coming from different 
journals: Journals do not all have the same average citation counts.

> In the many separate postings and papers from the SH group, such as ****
> and ***** done without our group's involvement, their authors refer only
> to the SH part of the small manual inter-rater reliability test. As it was
> a small and nonrandom sample, it yields an anomalous discriminability
> index of 2.45, unlike the values found for larger individual tests or for
> the combined sample. They then use that partial result by itself to prove
> the robot's accuracy.
>
> **** such as "Open Access to Research Increases Citation Impact"  by
> Chawki Hajjem, Yves Gingras, Tim Brody, Les Carr, and Stevan Harnad
> http://eprints.ecs.soton.ac.uk/11687
>
> *****: "Ten-Year Cross-Disciplinary Comparison of the Growth of Open
> Access and How it Increases Research Citation Impact" by 5. C. Hajjem, S.
> Harnad, and Y. Gingras in IEEE Data Engineering Bulletin, 2005,
> http://eprints.ecs.soton.ac.uk/11688/

No one is "proving" (or interested in proving) robot accuracy! In our 
publications to date, we cite our results to date. The Goodman et al. test 
results came out too late to be mentioned in the ***** published article, 
but they will be mentioned in the **** updated preprint (along with the 
further results from our ongoing tests).

> None of the SH group's postings or publications refer to the joint 
> report from the two groups, of which they could not have been ignorant, 
> as the report was concurrently being evaluated and reviewed by SH.

Are Goodman et al. suggesting that there has been some suppression of 
information here -- information from reports that we have co-signed and 
co-posted publicly? Or are Goodman et al. concerned that they are not 
getting sufficient credit for something?

> Considering that both the joint ecs technical report ** and the separate 
> SH group report***** were both posted on Dec .16 2005, we have here 
> perhaps the first known instance of a author posting findings on the 
> same subject, on the same day, as adjacent postings on the same list, 
> but with opposite conclusions.

One of the postings being a published postprint and the other an 
unpublished preprint! Again, what exactly is Goodman et al.'s point?

> In view of these joint results, there is good reason to consider all 
> current and earlier automated results performed using the CH algorithm 
> to be of doubtful validity. The reader may judge: merely examine the 
> graphs in the original joint Technical Report; **. They speak for 
> themselves.

No, the robot accuracy tests do not speak for themselves. Nor does the 
conclusion of Goodman et al's preprint (***) (which I am now rather 
beginning to regret having obligingly "co-signed"!):

     "In conclusion, the robot is not yet performing at a desirable level
     and future work may be needed to determine the causes, and improve
     the algorithm."

What *I* meant in agreeing with that conclusion was that we needed to find 
out why there were the big differences in the robot accuracy estimates 
(between our two samples and between the two fields). The robot's 
detection accuracy can and will be tightened, if and when it becomes clear 
that it needs to be, for our primary purpose (measuring and comparing the 
OA citation advantage across fields) or even our secondary purpose 
(estimating the relative percentage of OA by field and year), but not as 
an end in itself (i.e., just for the sake of increasing or "proving" robot 
accuracy).

The reason we are doing our analyses with a robot rather than by hand is 
to be able to cover far more fields, years and articles, more quickly, 
than it is possible to do by hand. The hand-samples are a good check on 
the accuracy of the robot's estimates, but they are not necessarily a 
level of accuracy we need to reach or even approach with the robot!

On the other hand, potential artifacts -- tending in opposite directions 
-- do need to be tested, and, if necessary, controlled for (including 
tightening the robot's accuracy):

     (1) to what extent is the OA citation "advantage" just a non-causal
     self-selection quality bias, with authors selectively self-archiving
     their higher-quality, hence higher citation-probability articles?

     (2) to what extent is the OA citation "advantage" just an artifact
     of false positives by the robot? (because there will be more false
     positives when there are more matches with the reference search from
     articles *other* than the article itself, hence more false positives
     with articles that are more cited on the web, which would make the
     robot-based outcome not an OA effect, and circular)

A third question (not about a potential artifact, but about a genuine
causal component of the OA advantage) is:

    (3) to what extent is the OA advantage an Early (preprint) Advantage
    (EA)?

For those who are interested in our ongoing analyses, I append some
further information below.

Stevan Harnad

Chawki: Here are the tests and controls that need to be done
to determine both the robot's accuracy in detecting and estimating
%OA and the causality of the observed citation advantage:

(1) When you re-do the searches in Biology and Sociology (to begin with: 
other disciplines can come later), make sure to (1a) store the number as 
well as the URLs of all retrieved sites that match the reference-query and 
(1b) make the robot check the whole list (up to at least the pre-specified 
N-item limit you used before) rather than the robot's stopping as soon as 
it thinks it has found that the item is "OA," as in your prior searches.

That way you will have, for each of your Biology and Sociology ISI 
reference articles, not only their citation counts, but also their 
query-match counts (from the search-engines) and also the number and 
ordinal position for every time the robot calls them "OA." (One item might 
have, say, k query-matches, with the 3rd, 9th and kth one judged "OA" by 
the robot, and the other k-3 judged non-OA.)

Both the number (and URLs) of query-matches and the ordinal position of 
the first "OA"-call and the total number and proportion of OA-calls will 
be important test data to make sure that our robot-based OA citation 
advantage estimate is *not* just a query-match-frequency and/or 
query-match frequency plus false alarm artifact. (The potential artifact 
is that the robot-based OA advantage is not an OA advantage at all, but 
merely a reflection of the fact that more highly cited articles are more 
likely to have online items that *cite* them, and that these online items 
are the ones the robot is *mistaking* for OA full-texts of the *cited* 
article itself.) (2) As a further check on robot accuracy, please use a 
subset of URLs for articles that we *know* to be OA (e.g., from PubMed 
Central, Google Scholar, Arxiv, CogPrints) and try both the search-engines 
(for % query-matches) and the robot (for "%OA") on them. That will give 
another estimate of the *miss* rate of the search-engines as well as of 
the robot's algorithm for OA.

(3) While you are doing this, in addition to the parameters that are 
stored with the reference (the citation count, the URLs for every 
query-match by the search, the number, proportion, and ordinal position of 
those of the matches that the robot tags as "OA"), please also store the 
citation impact factor of the *journal* in which the reference article was 
published. (We will use this to do sub-analyses to see whether the pattern 
is the same for high and low impact journals, and across disciplines; we 
will also look at it separately, for %OA among articles at different 
citation levels (1, 2-3, 4-7, 7-15, 16-31, 32-63, 64+), again within and 
across years and disciplines.)

(4) The sampling for Biology and Sociology should of course be based on 
*pairs* within the same journal/year/issue-number: Assuming that you will 
be sampling 500 pairs (i.e., 1000 items) in each discipline (1000 Biology, 
1000 Sociology), please first pick a *random* sample of 50 pairs for each 
year, and then, within each pair, pick, at *random*, one OA and one non-OA 
article per same issue. Use only the robot's *first* ordinal OA as your 
criterion for "OA" (so that you are duplicating the methodology the robot 
had used); the criterion for non-OA is, as before: none found among all of 
the search matches). If you feel you have the time, it would also be 
informative to check the 2nd or 3rd "OA" item if the robot found more than 
one. That too would be a good control datum, for evaluating the robot's 
accuracy under different conditions (number of matches; number/proportion 
of them judged "OA").

     http://eprints.ecs.soton.ac.uk/11687/
     http://eprints.ecs.soton.ac.uk/11688/
     http://eprints.ecs.soton.ac.uk/11689/

(5) Count also the number of *journals* for which the robot judges that it 
is at or near 100% OA (for those are almost certainly OA journals and not 
self-archived articles). Include them in your %OA counts, but of course 
not in your OA/NOA ratios. (It would be a good idea to check all the ISI 
journal names against the DOAJ OA journals list -- about 2000 journals -- 
to make sure you catch all the OA journals.) Keep a count also of how many 
individual journal *issues* has either 100% OA or 0% OA (and were hence 
eliminated from the OA/NOA citation ratio). Those numbers will also be 
useful for later analyses and estimates.

With these data we will be in a much better position to estimate the 
robot's accuracy and some of the factors contributing to the OA citation 
advantage.