[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Whether Self-Selected or Mandated, Open Access Increases Citation Impact for Higher Quality Research



      ** APOLOGIES FOR CROSS-POSTING **

What follows below is -- I think readers will agree -- a 
conscientious and attentive series of responses to questions 
raised by Phil Davis about our paper testing whether the OA 
citation Advantage is just a side-effect of author self-selection 
(Gargouri et al, currently under refereeing) -- responses for 
which we did further analyses of our data (not included in the 
draft under refereeing).

Gargouri, Y., Hajjem, C., Lariviere, V., Gingras, Y., Brody, T., 
Carr, L. and Harnad, S. (2010) Self-Selected or Mandated, Open 
Access Increases Citation Impact for Higher Quality 
Research.(Submitted) http://eprints.ecs.soton.ac.uk/18346/

We are happy to have performed these further analyses, and we are 
very much in favor of this sort of open discussion and feedback 
on pre- refereeing preprints of papers that have been submitted 
and are undergoing peer review. They can only improve the quality 
of the eventual published version of articles.

However, having carefully responded to Phil's welcome questions, 
below, we will, at the end of this posting, ask Phil to respond 
in kind to a question that we raised about his own paper (Davis 
et al 2008) a year and a half ago...

RESPONSES TO DAVIS'S QUESTIONS ABOUT OUR PAPER:

On 8-Jan-10, at 10:06 AM, Philip Davis wrote:

> PD:
> Stevan,
> Granted, you may be more interested in what the referees of the 
> paper have to say than my comments; I'm interested in whether 
> this paper is good science, whether the methodology is sound 
> and whether you interpret your results properly.

We are very appreciative of your concern and hope you will agree 
that we have not been interested only in what the referees might 
have to say. (We also hope you will now in turn be equally 
responsive to a longstanding question about your own paper on 
this same topic.)

> PD:
> For instance, it is not clear whether your Odds Ratios are 
> interpreted correctly.  Based on Figure 4, OA article are MORE 
> LIKELY to receive zero citations than 1-5 citations (or 
> conversely, LESS LIKELY to receive 1-5 citations than zero 
> citations). You write: "For example, we can say for the first 
> model that for a one unit increase in OA, the odds of receiving 
> 1-5 citations (versus zero citations) increased by a factor of 
> 0.957. Figure 4.. (p.9)

You are interpreting the figure incorrectly. It is the higher 
citation count that is in each case more likely, as co-author 
Yassine Gargouri pointed out to you in a subsequent response, to 
which you replied:

> PD:
> Yassine, Thank you for your response.  I find your odds ratio 
> methodology unnecessarily complex and unintuitive but now 
> understand your explanation, thank you.

Our article supports its conclusions with several different, 
convergent analyses. The logistical analysis with the odds ratio 
is one of them, and its results are fully corroborated by the 
other, simpler analyses we also reported, as well as the 
supplementary analyses we append here now.

> PD:
> Similarly in Figure 4 (if I understand the axes correctly), 
> CERN articles are more than twice as likely to be in the 20+ 
> citation category than in the 1-5 citation category, a fact 
> that may distort further interpretation of your data as it may 
> be that institutional effects may explain your Mandated OA 
> effect.  See comments by Patrick Gaule and Ludo Waltman on the 
> review http://j.mp/8LK57u

Here is the analysis underlying Figure 4, re-done without CERN, 
and then again re-done without either CERN or Southampton. As 
will be seen, the outcome pattern, as well as its statistical 
significance, are the same whether or not we exclude these 
institutions.

SUPPLEMENTARY FIGURE S1: 
http://eprints.ecs.soton.ac.uk/18346/7/Supp1_CERN%2DSOTON.pdf

On 11-Jan-10, at 12:37 PM, Philip Davis wrote:

> PD:
> Changing how you report your citation ratios, from the ratio of 
> log citations to the log of citation ratios is a very 
> substantial change to your paper and I am surprised that you 
> point out this reporting error at this point.

As noted in Yassine's reply to Phil, that formula was incorrectly 
stated in our text, once; in all the actual computations, 
results, figures and tables, however, the correct formula was 
used.

> PD:
> While it normalizes the distribution of the ratios, it is not 
> without problems, such as:
>
> 1. Small citation differences have very large leverage in your 
> calculations.  Example, A=2 and B=1, log (A/B)=0.3

The log of the citation ratio was used only in displaying the 
means (Figure 2), presented for visual inspection. The 
paired-sample t-tests of significance (Table 2) were based on the 
raw citation counts, not on log ratios, hence had no leverage in 
our calculations or their interpretations. (The paired-sample 
t-tests were also based only on 2004-2006, because for 2002-2003 
not all the institutional mandates were yet in effect.)

Moreover, both the paired-sample t-test results (2004-2006) and 
the pattern of means (2002-2006) converged with the results of 
the (more complicated) logistical regression analyses and 
subdivisions into citation ranges.

> PD:
> 2. Similarly, any ratio with zero in the denominator must be 
> thrown out of your dataset.  The paper does not inform the 
> reader on how much data was ignored in your ratio analysis 
> and we have no information on the potential bias this may 
> have on your results.

As noted, the log ratios were only used in presenting the means, 
not in the significance testing, nor in the logistic regressions.

However, we are happy to provide the additional information Phil 
requests, in order to help readers eyeball the means. Here are 
the means from Figure 2, recalculated by adding 1 to all citation 
counts. This restores all log ratios with zeroes in the numerator 
(sic); the probability of a zero in the denominator is 
vanishingly small, as it would require that all 10 same-issue 
control articles have no citations!

The pattern is again much the same. (And, as noted, the 
significance tests are based on the raw citation counts, which 
were not affected by the log transformations that exclude 
numerator citation counts of zero.)

SUPPLEMENTARY FIGURE S2: 
http://eprints.ecs.soton.ac.uk/18346/12/Supp2_Cites%2B1.pdf

This exercise suggested a further heuristic analysis that we had 
not thought of doing in the paper, even though the results had 
clearly suggested that the OA advantage is not evenly distributed 
across the full range of article quality and citeability: The 
higher quality, more citeable articles gain more of the citation 
advantage from OA.

In the following supplementary figure (S3), for exploratory and 
illustrative purposes only, we re-calculate the means in the 
paper's Figure 2 separately for OA articles in the citation range 
0-4 and for OA articles in the citation range 5+.

SUPPLEMENTARY FIGURE S3: 
http://eprints.ecs.soton.ac.uk/18346/17/Supp3_CiteRanges.pdf

The overall OA advantage is clearly concentrated on articles in 
the higher citation range. There is even what looks like an OA 
DISadvantage for articles in the lower citation range. This may 
be mostly an artifact (from restricting the OA articles to 0-4 
citations and not restricting the non-OA articles), although it 
may also be partly due to the fact that when unciteable articles 
are made OA, only one direction of outcome is possible, in the 
comparison with citation means for non-OA articles in the same 
journal and year: OA/non-OA citation ratios will always be 
unflattering for zero-citation OA articles. (This can be 
statistically controlled for, if we go on to investigate the 
distribution of the OA effect across citation brackets directly.)

> PD:
> Have you attempted to analyze your citation data as continuous 
> variables rather than ratios or categories?

We will be doing this in our next study, which extends the time 
base to 2002-2008. Meanwhile, a preview is possible from plotting 
the mean number of OA and non-OA articles for each citation 
count. Note that zero citations is the biggest category for both 
OA and non-OA articles, and that the proportion of articles at 
each citation level decreases faster for non-OA articles than for 
OA articles; this is another way of visualizing the OA advantage. 
At citation counts of 30 or more, the difference is quite 
striking, although of course there are few articles with so many 
citations:

SUPPLEMENTARY FIGURE 4: 
http://eprints.ecs.soton.ac.uk/18346/22/Supp4_IndivCites.pdf

--------

REQUEST FOR RESPONSE TO QUESTION ABOUT DAVIS ET AL'S (2008) Paper:

Davis, PN, Lewenstein, BV, Simon, DH, Booth, JG, & Connolly, MJL 
(2008) Open access publishing, article downloads, and citations: 
randomised controlled trial British Medical Journal 337: a568 
http://www.bmj.com/cgi/content/full/337/jul31_1/a568

Critique of Davis et al's paper:
"Davis et al's 1-year Study of Self-Selection Bias: No 
Self-Archiving Control, No OA Effect, No Conclusion" 
http://www.bmj.com/cgi/eletters/337/jul31_1/a568#199775

Davis et al had taken a 1-year sample of biological journal 
articles and randomly made a subset of them OA, to control for 
author self- selection. (This is comparable to our mandated 
control for author self- selection.) They reported that after a 
year, they found no significant OA Advantage for the randomized 
OA for citations (although they did find an OA Advantage for 
downloads) and concluded that this showed that the OA citation 
Advantage is just an artifact of author self- selection, now 
eliminated by the randomization.

What Davis et al failed to do, however, was to demonstrate, in 
the same sample and time-span, that author self-selection 
generates the OA citation Advantage. Without doing that, all they 
have shown is that in their sample and time-span, they found no 
significant OA citation Advantage. This is no great surprise, 
because their sample was small and their time-span was short, 
whereas the many of the other studies that have reported finding 
an OA Advantage were based on much larger samples and much longer 
time spans.

The question raised was about controlling for self-selected OA. 
If one tests for the OA Advantage, whether self-selected or 
randomized, there is a great deal of variability, across articles 
and disciplines, especially for the first year or so after 
publication. In order to have a statistically reliable measure of 
OA effects, the sample has to be big enough, both in number of 
articles and in the time allowed for any citation advantage to 
build up to become detectable and statistically reliable.

Davis et al need to do with their randomization methodology what 
we have done with our mandating methodology, namely, to 
demonstrate the presence of a self-selected OA Advantage in the 
same journals and years. Then they can compare that with 
randomized OA in those same journals and years, and if there is a 
significant OA Advantage for self-selected OA and no OA Advantage 
for randomized OA then they will have evidence that some or all 
of the OA Advantage is just a side- effect of self-selection. 
Otherwise, all they have shown is that with their journals, 
sample size and time-span, there is no detectable OA Advantage at 
all.

What Davis et al replied in their Authors' Response was instead 
this:

http://www.bmj.com/cgi/eletters/337/jul31_1/a568#200109

> PD:
> "Professor Harnad comments that we should have implemented a 
> self-selection control in our study. Although this is an 
> excellent idea, it was not possible for us to do so because, at 
> the time of our randomization, the publisher did not permit 
> author-sponsored open access publishing in our experimental 
> journals. Nonetheless, self-archiving, the type of open access 
> Prof. Harnad often refers to, is accounted for in our 
> regression model (see Tables 2 and 3)... Table 2 Linear 
> regression output reporting independent variable effects on PDF 
> downloads for six months after publication Self-archived: 6% of 
> variance p = .361 (i.e., not statistically significant)... 
> Table 3 Negative binomial regression output reporting 
> independent variable effects on citations to articles aged 9 to 
> 12 months Self-archived: Incidence Rate 0.9 p = .716 (i.e., not 
> statistically significant)...

This is not an adequate response. If a control condition was 
needed in order to make am outcome meaningful, it is not 
sufficient to reply that "the publisher and sample allowed us to 
do the experimental condition but not the control condition."

Nor is it an adequate response to reiterate that there was no 
significant self-selected self-archiving effect in the sample (as 
the regression analysis showed). That is in fact bad news for the 
hypothesis being tested.

Nor is it an adequate response to say, as Phil did in a later 
posting, that even after another half year or more had gone by, 
there there was still no significant OA Advantage. (That is just 
the sound of one hand clapping again, this time louder.)

The only way to draw meaningful conclusions from Davis et al's 
methodology is to demonstrate the self-selected self-archiving 
citation advantage, for the same journals and time-span, and then 
to show that randomization wipes it out.

Until then, our own results, which do demonstrate the 
self-selected self-archiving citation advantage for the same 
journals and time-span, show that mandating the self-archiving 
does not wipe it out.

Meanwhile, Davis et al's finding that although their randomized 
OA did not generate a citation increase, it did generate a 
download increase, suggests that with a larger sample and 
time-span there may well be scope for a citation advantage as 
well: Our own prior work and that of others has shown that higher 
earlier download counts tend to lead to higher later citation 
counts.

Bollen, J., Van de Sompel, H., Hagberg, A. and Chute, R. (2009) A 
principal component analysis of 39 scientific impact measures 
arXiv.org, arXiv:0902.2183v1 [cs.CY], 12 Feb. 2009, in PLoS ONE 
4(6): e6022, http://dx.doi.org/10.1371/journal.pone.0006022

Brody, T., Harnad, S. and Carr, L. (2006) Earlier Web Usage 
Statistics as Predictors of Later Citation Impact. Journal of the 
American Association for Information Science and Technology 
(JASIST) 57(8) 1060-1072. http://eprints.ecs.soton.ac.uk/10713/

Lokker, C., McKibbon, K. A., McKinlay, R.J., Wilczynski, N. L. 
and Haynes, R. B. (2008)  Prediction of citation counts for 
clinical articles at two years using data available within three 
weeks of publication: retrospective cohort study BMJ, 
2008;336:655-657 
http://www.bmj.com/cgi/content/abstract/336/7645/655

Moed, H. F. (2005) Statistical Relationships Between Downloads 
and Citations at the Level of Individual Documents Within a 
Single Journal (abstract only) Journal of the American Society 
for Information Science and Technology, 56(10): 1088- 1097

O'Leary, D. E. (2008)  The relationship between citations and 
number of downloads Decision Support Systems. 45(4): 972-980 
http://dx.doi.org/10.1016/j.dss.2008.03.008

Watson, A. B. (2009) Comparing citations and downloads for 
individual articles Journal of Vision, 9(4): 1-4 
http://journalofvision.org/9/4/i/

******