[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: FW: Crawling publishers' sites



Hi Scott,


I don't know about the legality of crawling publisher's sites, but doing
so may not work out very well because they likely prohibit crawlers from
indexing their sites (using the robots.txt method, for example), or have
their "goods" behind some kind of password. While some crawlers such as
ht://Dig can be configured to use passwords to access restricted
directories, I'm not sure how these crawlers handle large numbers of
passwords. Also, the different kinds of passwords used (basic web server
passwords, passwords entered via a form in a web page, etc.) might
complicate things.

On the other hand, ht://Dig can index Adobe PDF files as well has standard
HTML files, so it might be a good tool for doing what you describe. Their
site is http://www.htdig.org/.

Mark

Mark Jordan
Librarian / Analyst, Systems Division
W.A.C. Bennett Library, Simon Fraser University
Burnaby, BC, V5A 1S6, Canada
Email mjordan@sfu.ca / Phone (604) 291 5753 / Fax (604) 291 3023 

________________________________________

On Thu, 15 Apr 1999, Mellon, Scott wrote:

> I would be interested in hearing about experiences or thoughts subscribers
> may have on the legality or ethics involved in sending a web crawler to
> visit and index the sites of publishers for whom we have site licences.
>
> The resulting database would be made available on our Intranet only;
> i.e. only for the use of those for whom we have licenced access to the
> publishers.
> Scott Mellon
> CISTI Advanced Services
> Ottawa, Canada K1A 0S2
> Tel: (613)993-0994, Fax / Docufax: (613) 952-8246
> mailto:scott.mellon@nrc.ca
> http://www.nrc.ca/cisti