[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
On metrics
- To: liblicense-l@lists.yale.edu
- Subject: On metrics
- From: Aaron Edlin <aedlin@bepress.com>
- Date: Fri, 19 Oct 2007 19:07:02 EDT
- Reply-to: liblicense-l@lists.yale.edu
- Sender: owner-liblicense-l@lists.yale.edu
Joe Esposito's recent post alerted the list to bepress's (The Berkeley Electronic Press's) download efforts. We recently spent substantial resources overhauling our methodology for counting full-text downloads of articles, and applied the new methodology to all our logfiles in order to provide our users with our most accurate estimate of legitimate downloads. See http://www.bepress.com/download_counts.html. We consider this effort to be part of our effort to act true to our motto as "The New Standard in Scholarly Publishing Since 1999." We discovered that we had a lot of room for improvement in the way our downloads were counted, and so we did our best to fix this for bepress customers. A lot of people on liblicense have been asking very good questions about what we did, why we did it, and whether it matters. Here are some answers WHY DOWNLOADS MATTER. For sound reasons, Liz Lorbeer questions the value of download counts. We appreciate her concerns. We were simply responding to what we heard from users. Put simply, download matter to bepress because they matter to our users: authors, libraries. Digital Commons subscribers. Many of my colleagues monitor the downloads of their papers like my wife monitors the fever of our baby. Of course, they know that downloads are a highly imperfect metric of a good paper and of successfully reaching an audience. Are the people actually reading the paper? Do they like the paper or cite it? Or do people simply click on a catchy title? Who knows? For some time, "Fuck" was the most downloaded paper at bepress; for a while, it surpassed a Paul Krugman article in downloads per day since publication. Could this have more to do with the title of the paper than its content? I shouldn't judge without reading it, but I wonder. For better or for worse, downloads are being used as a sign of prestige and productivity for merit reviews, journal acceptance and the success of repositories. Therefore accurate measurement is important. Accuracy requires, at the least, that double clicks by people should be counted only once and that downloads by automated processes, as opposed to downloads by people, should not be counted at all. However, although double clicks are easy to identify, automated processes are not. Peter Shepherd, Director of COUNTER, asks whether "inflation is proportionately the same across vendors." I'll go out on a limb and say that it seems awfully unlikely. It takes expense, research, and cleverness to catch automated processes. Publishers and Repository owners may have little or no incentives to limit downloads counts, as things stand. After all, everyone wants more downloads. HOW DOWNLOAD INFLATION VARIES Since we only studied bepress data, we can't tell you how count inflation varies across publishers, but we can say something about how it varies between open access content and restricted access content. Download inflation is highly variable, just as we suspected. Open access papers have dramatically higher inflation than restricted access, but restricted access inflation remains substantial. Even within these two classes download inflation varies a great deal. One paper for example had over 8500 downloads even with our old filters. With our newer more accurate filters, it had only 6. Happily, that is not typical, but significant download inflation is typical in our sample. Some of our findings surprised me a lot. I, for example, shared the skepticism that was recently expressed by Phil Davis, a PhD student at Cornell, and by Peter Shepherd, the Director of COUNTER, a valiant project that tries to make sure that when humans or machines double click there is only one count. All three of us figured that for subscription based academic journals, where permitted access is limited to those from IPs at subscribing institutions, that double clicks by humans could be significant, but downloads by automatic processes would be negligible. My tech group hurt my feelings by calling me naive. And, I guess they were right. It is true that the problem of download inflation is much worse in open access than in restricted. But the problem is substantial in restricted access journals too, assuming that bepress's experience is representative. In fact, we "catch" automated processes coming from subscriber IPs downloading our restricted access journals in roughly equal number to double counts that COUNTER compliance eliminates. So, if COUNTER matters for restricted access journals, what we have done matters too. How can automated processes come from a closed community I asked our tech team? First, they disabused me of my professorial idea that all members of the academic community are benign. They remind me that computer viruses may be written by college kids or perhaps by professors like me and that denial of service attacks can come from them too. In addition, people outside the academy may highjack machines within the closed community and use them. Computer science researchers interested in building new fangled search engines might download thousands of papers not to read but to serve as a database for their research. Moreover, LOCKSS crawlers turn out to download a lot of restricted access content. Are the other publishers excluding those counts? I hope so, but do not know. If other publishers are on this list, please do tell. Our restricted access journals are probably subject to more automated processes than other publishers because we have a liberal guest access policy intended for humans, but imperfectly restricted to them. However, we isolated that effect and still find lots of downloads that we identify as coming from automated processes arising (at least most directly) from the IP addresses of the closed communities of our subscribers-again, we reject downloads from automated processes in roughly equal number to COUNTER rejections. So my tech team wins again. I was naive. On which subject, bepress excludes all downloads coming from within bepress. Do other publishers? Should we? Some of our downloads are human interest no different from any other. These should be counted I think. Other downloads are connected with our business, testing response time and the like. To be conservative, we exclude them all. Do other publishers? Again, I don't know. WHY DID WE INVEST IN REDUCING OUR DOWNLOADS. We gathered together a few big bepress meetings last winter and spring. We discussed several things. First, we were hearing more and more about anomalies: papers with far more downloads than was plausible. Second, it was clear that new madness happens on the internet all the time. Once upon a time, we spent a lot of time making as sure as we could that we only reported human downloads. Should we open this can of worms again? I had two hesitations. One hesitation was technical. Distinguishing the download of an automated process from a human interested in reading an article seems difficult. Some automated processes call themselves out and declare "I am a crawler," but if they don't, then in the immediate, all downloads appear alike. One must look at data signature patterns to distinguish. This seems like a job for the NSA or for Steve Levitt, author of Freakonomics, and founder of forensic economics. Only problem was this: the NSA is busy with terrorists and Steve Levitt isn't on our staff. Luckily our biggest baddest programmer was interested in the challenge, so this difficulty was solved. The other hesitation was with the business logic. In both our open access services and our restricted access journals, we like everyone else on the internet are in a competition for eyeballs. Could it possibly make sense when everyone is competing for more and more downloads to compete by investing a lot of money to be able to lower our downloads by 10, 20, 50% or who knows how much. At first blush, this seemed simply insane. Could we possibly, a small player in the scheme of things come out and say: "Your downloads are down 20% and this is a good thing?" Many on the staff thought we could. I was skeptical. I remain skeptical from business perspective. This time, I think they are naive. But if they are naive, it is a wonderful kind of naive. And, if what I wanted out of life was to make a zillion dollars and own the world, I would not be spending this kind of time working on scholarly communication. Hopefully, the decision to do this was not naive. But, regardless, I am sure that the effort was the right thing to do. We hope it starts a conversation. Best, Aaron Aaron Edlin Chairman, The Berkeley Electronic Press Richard Jennings Professor of Economics and Law, UC Berkeley Homepage: http://works.bepress.com/aaron_edlin/ Co-Editor, The Economists' Voice, http://www.bepress.com/ev Editor, The B.E. Journals of Theoretical Economics, http://www.bepress.com/bejte
- Prev by Date: New digitization tool
- Next by Date: 2008 Ingenta Research Award Call for submissions
- Previous by thread: RE: On metrics
- Next by thread: Re: On metrics
- Index(es):