[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On metrics

To: liblicense-l@lists.yale.edu
Subject: On metrics
From: Aaron Edlin <aedlin@bepress.com>
Date: Fri, 19 Oct 2007 19:07:02 EDT
Reply-to: liblicense-l@lists.yale.edu
Sender: owner-liblicense-l@lists.yale.edu
Joe Esposito's recent post alerted the list to bepress's (The 
Berkeley Electronic Press's) download efforts.  We recently spent 
substantial resources overhauling our methodology for counting 
full-text downloads of articles, and applied the new methodology 
to all our logfiles in order to provide our users with our most 
accurate estimate of legitimate downloads. See 
http://www.bepress.com/download_counts.html.

We consider this effort to be part of our effort to act true to 
our motto as "The New Standard in Scholarly Publishing Since 
1999."  We discovered that we had a lot of room for improvement 
in the way our downloads were counted, and so we did our best to 
fix this for bepress customers.

A lot of people on liblicense have been asking very good 
questions about what we did, why we did it, and whether it 
matters.  Here are some answers

WHY DOWNLOADS MATTER.

For sound reasons, Liz Lorbeer questions the value of download 
counts.  We appreciate her concerns.  We were simply responding 
to what we heard from users.

Put simply, download matter to bepress because they matter to our 
users: authors, libraries. Digital Commons subscribers. Many of 
my colleagues monitor the downloads of their papers like my wife 
monitors the fever of our baby.  Of course, they know that 
downloads are a highly imperfect metric of a good paper and of 
successfully reaching an audience.  Are the people actually 
reading the paper?  Do they like the paper or cite it?  Or do 
people simply click on a catchy title? Who knows?  For some time, 
"Fuck" was the most downloaded paper at bepress; for a while, it 
surpassed a Paul Krugman article in downloads per day since 
publication.  Could this have more to do with the title of the 
paper than its content?  I shouldn't judge without reading it, 
but I wonder.

For better or for worse, downloads are being used as a sign of 
prestige and productivity for merit reviews, journal acceptance 
and the success of repositories. Therefore accurate measurement 
is important.

Accuracy requires, at the least, that double clicks by people 
should be counted only once and that downloads by automated 
processes, as opposed to downloads by people, should not be 
counted at all.  However, although double clicks are easy to 
identify, automated processes are not.

Peter Shepherd, Director of COUNTER, asks whether "inflation is 
proportionately the same across vendors."  I'll go out on a limb 
and say that it seems awfully unlikely.  It takes expense, 
research, and cleverness to catch automated processes. 
Publishers and Repository owners may have little or no incentives 
to limit downloads counts, as things stand.  After all, everyone 
wants more downloads.

HOW DOWNLOAD INFLATION VARIES

Since we only studied bepress data, we can't tell you how count 
inflation varies across publishers, but we can say something 
about how it varies between open access content and restricted 
access content.

Download inflation is highly variable, just as we suspected. 
Open access papers have dramatically higher inflation than 
restricted access, but restricted access inflation remains 
substantial. Even within these two classes download inflation 
varies a great deal.  One paper for example had over 8500 
downloads even with our old filters.  With our newer more 
accurate filters, it had only 6.  Happily, that is not typical, 
but significant download inflation is typical in our sample.

Some of our findings surprised me a lot.  I, for example, shared 
the skepticism that was recently expressed by Phil Davis, a PhD 
student at Cornell, and by Peter Shepherd, the Director of 
COUNTER, a valiant project that tries to make sure that when 
humans or machines double click there is only one count.  All 
three of us figured that for subscription based academic 
journals, where permitted access is limited to those from IPs at 
subscribing institutions, that double clicks by humans could be 
significant, but downloads by automatic processes would be 
negligible.  My tech group hurt my feelings by calling me naive. 
And, I guess they were right.

It is true that the problem of download inflation is much worse 
in open access than in restricted.  But the problem is 
substantial in restricted access journals too, assuming that 
bepress's experience is representative. In fact, we "catch" 
automated processes coming from subscriber IPs downloading our 
restricted access journals in roughly equal number to double 
counts that COUNTER compliance eliminates.  So, if COUNTER 
matters for restricted access journals, what we have done matters 
too.

How can automated processes come from a closed community I asked 
our tech team? First, they disabused me of my professorial idea 
that all members of the academic community are benign.  They 
remind me that computer viruses may be written by college kids or 
perhaps by professors like me and that denial of service attacks 
can come from them too.  In addition, people outside the academy 
may highjack machines within the closed community and use them. 
Computer science researchers interested in building new fangled 
search engines might download thousands of papers not to read but 
to serve as a database for their research.  Moreover, LOCKSS 
crawlers turn out to download a lot of restricted access content. 
Are the other publishers excluding those counts?  I hope so, but 
do not know.  If other publishers are on this list, please do 
tell.  Our restricted access journals are probably subject to 
more automated processes than other publishers because we have a 
liberal guest access policy intended for humans, but imperfectly 
restricted to them. However, we isolated that effect and still 
find lots of downloads that we identify as coming from automated 
processes arising (at least most directly) from the IP addresses 
of the closed communities of our subscribers-again, we reject 
downloads from automated processes in roughly equal number to 
COUNTER rejections. So my tech team wins again.  I was naive.

On which subject, bepress excludes all downloads coming from 
within bepress. Do other publishers?  Should we?  Some of our 
downloads are human interest no different from any other.  These 
should be counted I think. Other downloads are connected with our 
business, testing response time and the like.  To be 
conservative, we exclude them all. Do other publishers? Again, I 
don't know.

WHY DID WE INVEST IN REDUCING OUR DOWNLOADS.

We gathered together a few big bepress meetings last winter and 
spring. We discussed several things.  First, we were hearing more 
and more about anomalies:  papers with far more downloads than 
was plausible.  Second, it was clear that new madness happens on 
the internet all the time.  Once upon a time, we spent a lot of 
time making as sure as we could that we only reported human 
downloads.  Should we open this can of worms again? I had two 
hesitations.

One hesitation was technical.  Distinguishing the download of an 
automated process from a human interested in reading an article 
seems difficult. Some automated processes call themselves out and 
declare "I am a crawler," but if they don't, then in the 
immediate, all downloads appear alike.  One must look at data 
signature patterns to distinguish.  This seems like a job for the 
NSA or for Steve Levitt, author of Freakonomics, and founder of 
forensic economics. Only problem was this:  the NSA is busy with 
terrorists and Steve Levitt isn't on our staff.  Luckily our 
biggest baddest programmer was interested in the challenge, so 
this difficulty was solved.

The other hesitation was with the business logic.  In both our 
open access services and our restricted access journals, we like 
everyone else on the internet are in a competition for eyeballs. 
Could it possibly make sense when everyone is competing for more 
and more downloads to compete by investing a lot of money to be 
able to lower our downloads by 10, 20, 50% or who knows how much. 
At first blush, this seemed simply insane.  Could we possibly, a 
small player in the scheme of things come out and say: "Your 
downloads are down 20% and this is a good thing?"  Many on the 
staff thought we could.  I was skeptical.  I remain skeptical 
from business perspective. This time, I think they are naive. 
But if they are naive, it is a wonderful kind of naive.  And, if 
what I wanted out of life was to make a zillion dollars and own 
the world, I would not be spending this kind of time working on 
scholarly communication. Hopefully, the decision to do this was 
not naive.  But, regardless, I am sure that the effort was the 
right thing to do. We hope it starts a conversation.

Best, Aaron

Aaron Edlin
Chairman, The Berkeley Electronic Press
Richard Jennings Professor of Economics and Law, UC Berkeley
Homepage: http://works.bepress.com/aaron_edlin/

Co-Editor, The Economists' Voice, http://www.bepress.com/ev
Editor, The B.E. Journals of Theoretical Economics,
http://www.bepress.com/bejte
Prev by Date: New digitization tool
Next by Date: 2008 Ingenta Research Award Call for submissions
Previous by thread: RE: On metrics
Next by thread: Re: On metrics
Index(es):
- Date
- Thread