[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Security Issues (was JSTOR) Pt. 2.




> The oirginal question is this:  if we know that IP authentication has this
> particular problem, then when should we continue to use it, and when/how
> should we be either abandoning it or trying to improve it?

I think a key point is that IP access is not authentication.  It used to
be that you could use it in place of authentication because it was hard to
get around telling a computer your real ip address. I think it's still
mostly this way. The two problems I've seen come from open proxy servers
and from renegade networks.  What I mean by the former is proxy sites
which do not do any authentication of their own, what I mean by the latter
is sites where the people in charge of the network are acting in collusion
with an abuser.

You can get around the first problem by having a proper authentication
system set up at the proxy end. As I wrote in a previous note, it can be
configured to be a minimum hassle for legitmate users, with just enough
authentication to deter abusers.  I don't think there is a solution for
the second problem, other than simply blocking them and perhaps sueing
them if they are in a country which has copyright laws.

My personal opinion is that IP based access is a good thing, and that
librarians who maintain proxys simply need to be sure they know exactly
how their software works.  They need to have expertise in maintaining it,
or have a responsive computer and networking staff which maintains it. If
that doesn't happen, if rare abuse like this JSTOR incident becomes common
(or blown out of proportion), I'm worried that publishers might react in
an extreme manner w/re to access policy, and start a downward spiral.

On the topic of people who might think "we don't care about *that* kind of
gap in security," I wonder if perhaps people are thinking about all the
costs that go into this kind of service? I saw something in the archive
from awhile ago, where someone asked a question along the line of "why
couldn't a patron walk in and photocopy every article out of an issue?" I
wasn't sure how many response there were, but the only ones I saw were
those dealing with copyright.  Please keep in mind that the publisher
might have more than content rights in mind when it comes to abuse. For
example, many places have to pay fees on the amount of network usage their
sites make use of.

For example, using some numbers I just pulled off google, a publisher
might pay a service provider to host a site for $250.00 a month, granting
up to 300 megabytes of traffic per month. In excess of 300 megabytes, a
charge of 50 cents per megabyte might be applied.  To put that in
perspective when it comes to spidering abuse, if for some reason HighWire
allowed someone to spider all the content hosted by us, we would be
talking about a terrabyte of data, at a minimum. I believe we are already
the largest user of Stanford's networking resources (either that, or
number two). And we're tiny compared to some places.

Ok, I'm going to get all long winded at this point. Those of you who don't
want to see me getting *really* preachy might want to hit the delete
button now.

Clearly both sides want the same end result: Legitimate users with
unrestricted access to the material they are paying for.  The publisher
needs readers, the readers obviously need the content.  As someone who
works for an aggregator I often feel stuck in the middle.  I hear
publishers who worry about their content being stolen, and also hear
enormous frustration from librarians who are sometimes confronted with
access policies which restrict IP access to only a handful of well
monitored computers.

At my place of work, we see some rather bad abuse coming from sites. We
see essentially what JSTOR wrote about: a computer within a network which
has access to content would spider one or more of our hosted sites, in
their entirety, often trying to download multiple resources at the same
time.  I ended up having to implement software which ends up frustrating
users, but saves us from the extremes of abuse.  What our sites have now
is a simple counter which watches how many requests a single ip makes
within any given minute.  If that threshold gets exceeded, we cut off
access for a period of time.

We're always trying to fine tune the program to block abuse and allow
legitimate use. It's not easy, because the nature of HTTP does not lend
itself to identifying unique clients making a request.  Many times it is
the very purpose of a proxy server to strip any such identifying
characteristics out of a request before it reaches us.  So, while this
program saves us from massive abuse which can bring an entire machine
down, it is extremely annoying to some sites with proxy servers.  It does
not help that some companies have forever tainted the use of cookies in
the eyes of many internet users. Doing so means we often cannot use a
simple, non-personally-identifying, method to track whether multiple
requests from a single ip are emanating from a single user or from
multiple users behind a proxy.

For example, there are some enormous networks out there which route all
HTTP access to a handful of proxy servers.  The administrators of those
sites express their frustration when our software kicks in, blocking 1/3
of their user base (i.e., we block one of three proxy servers which
handles thousands of users).  We tell them of some options which could
reduce the problem, and they essentially tell us they are unwilling to
change anything. They have privacy concerns which supersed any issues we
are trying to resolve w/re to abuse.  It's a deadlock. We're not willing
to let performance for our other users come to a crawl, and they are not
willing to budge on privacy concerns.

Another source of frustration comes from a lack of response.  Sometimes we
will notice a pattern of abuse from a set of ip addresses. We will contact
that site's administrators, and recieve no reply. Nothing even indicating
the message was recieved.  At that point, we have to block the site by
default.  In other words, wait for somene to contact us.

The librarians naturally see the problem from the opposite end. If their
site isn't a source of problems, all they might end up seeing is a
publisher who decides "all of a sudden" that a site license now means
paying X thousand dollars to license three computers at a library, or
limiting access to ten users at any one time, and so forth.

So the problem is complicated, and I think there is a very large social
compontent to the issue.  I've only worked at two places in my life, both
were college environments.  In both places the library had a lot of clout
when it came to computer and networking issues.  They were able to talk to
networking and security about issues.  If that's the case in most places
(and I have no idea whether or not it is), the solution might simpy be
better communication.  If a publisher looks at access logs and sees that
an instition's usage has gone up by a factor of three hundred, perhaps the
site administrator and publisher could work together to determine if this
was abuse of access?

I guess my long winded point is that it's better if both sides work
together.  A solution implemented by only the publisher's input is bound
to leave the library unhappy. An unhappy library might not renew
subscriptions, which is bound to leave the publisher unhappy. And hey,
that might end up putting the aggregator out of business, and we wouldn't
want that now, right? Right? Please?

Jim
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
James A. Robinson                       jim.robinson@stanford.edu
Stanford University HighWire Press      http://highwire.stanford.edu/
650-723-7294 (W) 650-725-9335 (F)