[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Data- and text-mining licensing



I have been involved in a number of discussions concerning data and text mining recently and wonder if anyone has any experience with these topics that they would like to share. The basic question is whether the license for an electronic resource in a form suitable to be read by humans extends as well to a license for machine-reading.

The area of data and text mining for scholarly materials is a new one, at least to me. My understanding is that materials (research data, user data, published articles, books, etc.) can be gathered together in such a way as to enable robots to sift through them and identify patterns and themes. These new patterns--effectively robot-generated discoveries--may include things that are not present in any single document in the collection. Thus, the collection is greater than the sum of its parts, but that greater value is only perceptible by machines. This past week I heard an excellent presentation (it is not yet online, but when the link becomes available, I will post it) by a biostatistician, who commented that human access to such databases is "of low value," in contrast to the "higher value of robot access."

Data and text mining are sometimes being discussed in the context of the idea of "Web 2.0," but I think this is a mistake. Web 2.0 is a concept of Tim O'Reilly's to describe the emerging practices on the Internet today in the areas of community-building and user-generated content. Web 2.0 is a metaphor, not a technical specification--but a very valuable metaphor. O'Reilly, for example, distinguishes between the early Web (his 1.0) and the evolving Web by contrasting Encyclopaedia Britannica and the Wikipedia. Both 1.0 and 2.0, however, share the fact that the users are humans. Data mining is a game for machines. It would be inaccurate to call it "Web 3.0" because machines don't require a Web interface at all. Web 2.0 is post-modern, but data-mining is post-human. Today's neologism: the Post Human Internet, or PHUNET for short, pronounced either FOO-net or (my preference) PEE-YOU-net. See Charles Stross's novel Accelerando.

Whether or not database mining of this kind will yield the kind of new insights some believe it will, I do not know, but it would be useful for the rights situation to be clarified early on to fend off litigation at a later time. It seems likely to me that publishers will begin to separate human- and machine-readable rights, just as they distinguish between subscriptions for libraries and individuals. There is an interesting precedent put forward by some members of the library community, who argue that it is reasonable for publishers to charge for hardcopy, but electronic materials should be free. It is conceivable that over time the "low value" of human-readable rights will become Open Access, leaving the higher value PHUNET rights for aggressive economic exploitation. It boggles the mind to think what a large collection of science articles could be worth some day.

Joe Esposito