[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fwd: Re: Role of arXiv


Begin forwarded message:

On 08/10/2010 12:56, "Stevan Harnad" <harnad@ecs.soton.ac.uk> 
> On Fri, 8 Oct 2010, Monica Duke wrote:
>>> SH:
>>> Harvesting is cheap. And each university's
>>> IR will be a standard part of
>>> its online infrastructure.
>> MD:
>> So far do we have enough (or any) evidence
>> that harvesting is cheap? What
>> sense of cheap did you mean?
> A harvester does not have to manage the
> deposit or host the content, as
> Arxiv does. It need only harvest and
> host the metadata. There countless
> such OA harvesters sprouting all over
> (not to mention the Google
> Scholar!) -- and that's on the sparse
> OA content that exists today (c
> 5-25%). Harvesters will abound once the
> OA content rises toward 100%,
> thanks to OA self-archiving mandates by
> universities and funders.
> History will confirm that we are simply
> spinning our wheels as we keep
> banging on about publishing costs,
> repository costs, harvesting costs --
> while our annual research usage and
> impact burns, because we have not
> got round to mandating deposit...
> Stevan Harnad

From: Hugh Glaser hg  -- ecs.soton.ac.uk
Date: October 10, 2010 6:06:16 PM EDT
Subject: Re: Role of arXiv

Spot on Stevan.

It is the work of a day or two to write a harvester for OAI-PMH 
from scratch (I know, I did it), although it is now pretty 
standard libraries. I know others who have done the same. I also 
wanted to translate into RDF, which added some effort.

It is then a case of letting it run and funding the maintenance 
and service. We have not bothered much to keep it up to date, but 
we use the metadata all the time for our applications, and it is 
not significant as a delta with all the other metadata.

The biggest cost is repository software that does not conform to 
the accepted view of OAI-PMH. Hopefully this will improve as more 
people harvest.

To be concrete, we harvested over 1000 repositories, 
automatically finding the details from the roar site, which seems 
to have resulted in 15G of data, and then translated into about 
24M triples and 21G of RDF. 20 times that, to use Stevan's lowest 
estimate, would be less than 1Tbyte, which is not really a lot of 
cost - right now I could serve that and probably run the whole 
system with harvesting for around $100/year on my ISP.

So after the initial costs (a month or two to do a great job?), 
it is a day a month plus $100.

The crucial thing here is, as Stevan says, that we are only 
talking metadata. The idea of the web is to avoid copying stuff, 
with attendant storage costs and synchronisation problems, and so 
the texts should be left where they lie.

Hugh Glaser