[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: internet archive (WAS: The Economist and e-Archiving)



If I understand this correctly, this means that all and any exclusions
would need to be put into the robot.txt file.  Surely this can't be done
retroactively? In relation to this thread, the question is, how can such a
file be used to get the Wayback Machine or other archiving services to
retroactively delete or deny access to content?

It looks from the example you have sent that the Economist refuses access
to index any materials in what looks like all the subdirectories on their
site - if that's the case, does that mean the whole of the Economist would
not be picked up by the Wayback Machine?


Chris Zielinski
STP, CSI/EGB/WHO
Avenue Appia, CH-1211
Geneva, Switzerland
Tel (Mobile): 0044797-10-45354

-----Original Message-----
From: owner-liblicense-l@lists.yale.edu
Sent: Monday, 30 June 2003 05:11
To: liblicense-l@lists.yale.edu
Subject: RE: internet archive (WAS: The Economist and e-Archiving)

You can only have one robots.txt file per internet host. Lack of a
robots.txt is interpreted as "it's ok to index this site" or "the idiots
that manage this site don't know about robot exclusion", which are treated
by robots as equivalent assertions. For the economist, which seems to know
what it's doing, it looks like this:

#
# Economist.com robots.txt
#
# Created MS 29 May 2001 Full disallow
# Amended MS 27 Jul 2001 Allow directories
#
User-agent: *
Disallow: /about
Disallow: /admin
Disallow: /background
Disallow: /bookshop
Disallow: /briefings
Disallow: /campusoffer
Disallow: /cart
Disallow: /CFDOCS
Disallow: /CFIDE
Disallow: /checkout
Disallow: /classes
Disallow: /cm
Disallow: /community
Disallow: /Copy of markets
Disallow: /countries_old
Disallow: /deal
Disallow: /editorial
Disallow: /email
Disallow: /events
Disallow: /globalagenda
Disallow: /help
Disallow: /images
Disallow: /library
Disallow: /maporama
Disallow: /markets
Disallow: /mba-direct
Disallow: /me
Disallow: /members
Disallow: /mobile
Disallow: /newswires
Disallow: /partners
Disallow: /perl
Disallow: /printedition
Disallow: /search
Disallow: /shop
Disallow: /shop_old
Disallow: /specialdeal
Disallow: /specialdeal1
Disallow: /studentoffer
Disallow: /subcenter
Disallow: /subscriptions
Disallow: /surveys
Disallow: /test.txt
Disallow: /tfs
Disallow: /travel2.economist
Disallow: /voucher


At 5:13 PM -0400 6/27/03, informania@supanet.com wrote:
>Eric Hellman wrote,
>
><In other words, if you put a robots.txt file on your server that excludes
>indexing of any files with path starting with "/content/", then they will
>remove from the archive any files from your server with path starting with
>"/content/".>
>
>So someone writing something that they think might get censored after
>publication should handily add a robots.txt file (="Kick me") at the front
>of their work so that the censorship can be accomplished on archive.com? I
>don't think so!
>
>Granted that, in the case of The Economist, the newspaper might take the
>decision to add such a file to all of their articles (still seems very
>doubtful though), but other than such mass-market publications, I can't
>see this happening. Consequently, in practice, retrospective deletions
>from The Wayback Machine remain difficult if not impossible.
>
>Chris Zielinski