[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Libraries criticized for role in Google Book Search (long)



Bernie:

Here are my thoughts:

Overall, I think these comments don't reflect the agreements and 
facts, or fail to accept that libraries operate with limited 
resources.

Respecting the comments that participating libraries "are just 
giving away access to one company that is cornering the market on 
on-line access," and have fostered the "centralizing and 
commercializing [of] knowledge under a single corporate 
umbrella," I disagree.  The participating libraries did not "give 
away" access to Google; they received what they perceived to be a 
valuable consideration, in the form of digital copies of those 
books (PDF plus OCR plus work-level and structural metadata), 
accompanied by what they perceived to be fair usage rights under 
the circumstances.  (See final paragraph respecting the 
circumstances of bargaining.)

Nor is Google "cornering the market on on-line access" to these 
all of these titles.  Respecting the public domain works, in many 
instances digital copies are already available on the Internet 
from sources such as the participants in the Open Content 
Alliance; and under their individual contracts with Google 
(which, I believe, will continue to govern the digitized public 
domain titles after the settlement becomes effective), 
participating libraries may make their digital copies available 
to their own patrons and to nonpatrons, through such third 
parties as HathiTrust.

Respecting the in-copyright but out of print titles, vendors 
other than Google, such as netLibrary, ebrary, and many others, 
have digitized thousands of such titles, which presently compete 
with Google's digitized copies.  In addition, the Google Book 
settlement, a nonexclusive agreement, enables participating 
libraries to negotiate new digitization agreements with the 
copyright owners and vendors other than Google, and facilitates 
such new transactions by permitting the Books Rights Registry to 
be used in deals with vendors other than Google.

Although Google may have a temporary advantage respecting older 
in-copyright and out of print titles, the settlement lowers entry 
barriers to that market.  Finally, the notion that the Google 
Book endeavor either "centraliz[es]" or "commercializ[es]" 
knowledge merits some comment.  First, "knowledge" is not the 
subject of the original Google contracts or the settlement, 
because copyright and other property rights in information at 
issue here attach at the level of expression, not of knowledge. 
All the materials at issue here are readily accessible to 
academic users and the public in print or digital format by 
avenues unrelated to Google.  The reservoir of human knowledge is 
not diminished by one drop by virtue of these agreements. 
Second, by lowering barriers to entry to the older in-copyright 
and out of print market, the settlement arguably will foster 
competition in that market, which may increase the dissemination 
of those works.  If that dissemination leads to greater 
knowledge, then the settlement may nurture, rather than 
constrain, the growth of knowledge.

Respecting the claim that participating libraries acted "without 
concern for user confidentiality," I think the documents read 
otherwise.  Respecting the University of Michigan's (UM) and 
University of Texas's (UT) original contracts with Google, 
personally identifying information of patrons may be protected by 
the phrase "customer lists" in section 6.1, or, if not, then the 
parties may well have thought that no personally identifiable 
information of individual patrons would be disclosed to Google 
during the digitization process or downstream.  In the settlement 
agreement, the parties appear to promise to keep confidential 
personally identifiable information of patrons in the phrase 
"about any customers" in section 15.1 by means of the 
confidentiality agreements referred to in section 15.2, and the 
auditors will keep such information confidential under a 
nondisclosure agreement pursuant to section 8.2(c)(i).

Respecting the claim that participating libraries acted "without 
concern for ... preservation . . . or long-term sustainability," 
I think that's inaccurate.  Respecting preservation format, the 
Library of Congress deems PDF a preferred digital preservation 
format for "[t]ext with page-layout rendering," see 
http://www.digitalpreservation.gov/formats/content/text_preferences.shtml

The UM and UT original contracts with Google require Google to 
give the libraries OCR, page images, and metadata (work-level and 
structural); that is, PDF files with embedded text and structural 
metadata (connecting text and images).  I believe those PDF files 
are consistent with the Library of Congress digital preservation 
standard.  (Note that the LC standard appears to permit PDF 
without structural tags, but Google provided structural tags with 
the library digital copies; see, e.g., 
http://babel.hathitrust.org/cgi/pt?id=mdp.39015055053659.)

(The settlement does not appear to specify the digital formats 
that Google will give Fully Participating Libraries.)

Respecting preservation environment, the UM and UT original 
contracts with Google enabled those libraries to transfer their 
digital copies to third parties, and UM has transferred them to 
HathiTrust, for, among other purposes, preservation.  HathiTrust 
appears to be pursuing a preservation strategy that complies with 
present standards.  See http://www.hathitrust.org/objectives .

What's more, the settlement agreement permits each Fully 
Participating Library to "reproduce and make technical 
adaptations to ... its [library digital copies] as reasonably 
necessary to preserve, maintain, manage, and keep [them] 
technologically current." Section 7.2(b)(i).

Respecting the claim that participating libraries acted "without 
concern for ... image quality," I think the documents read 
otherwise. The UM and UT original contracts with Google expressly 
give the libraries the right to engage in quality control of the 
images by sampling them on a regular basis.

Respecting the claim that participating libraries acted "without 
concern for ... search prowess," that's not how the agreements 
read or the end-products appear.  To the extent that "search 
prowess" depends upon both OCR and structural metadata, the UM 
and UT original contracts with Google provided for both.  To the 
extent that "search prowess" depends upon the quality of the 
search engine applied to the copies that Google retained, I think 
little needs to be said about the quality of Google's current 
full text search service.  To the extent that "search prowess" 
depends upon the quality of the search engines applied to the 
library-retained copies, the UM and UT original contracts with 
Google permit access through those libraries' own search 
services, as well as through services of third parties.

For example, HathiTrust plans to develop advanced search tools 
for retrieval of Google library digital copies transferred to it, 
including "[r]obust discovery mechanisms like full-text 
cross-repository searching." See 
http://www.hathitrust.org/objectives.  The settlement permits 
each Fully Participating Library to "develop or obtain and . . . 
deploy finding tools that allow its users to identify pertinent 
Books within its [library digital copies] or generate information 
from" the same, section 7.2(b)(iv), including search tools to be 
used in data mining.  Section 7.2(b)(vi).

Respecting the claim that participating libraries acted "without 
concern for . . . metadata standards," again this seems 
inaccurate.  As noted above, the original UM and UT contracts 
with Google required Google to provide work-level and structural 
metadata with the library digital copies, and this metadata 
appears to conform to the Library of Congress's digital 
preservation standards.  As one can see by viewing the library 
digital copies in HathiTrust, those copies are linked to full 
MARC 21 bibliographic records, (MARC 21 being an international 
metadata standard; see http://www.loc.gov/marc/annmarc21.html); 
and feature PDF structural metadata (both structural tags 
identifying document segments and metadata linking text and 
images), PDF being a national digital preservation standard (see 
http://www.digitalpreservation.gov/formats/content/text_preferences.shtml).

Respecting the assertion that the participating libraries "chose 
the expedient way rather than the best way to build and extend 
their collections," this seems too harsh a view of research 
libraries with limited cash resources.  Authorities seem to say 
that the "best" way to digitize text files, if cost is no issue, 
is to generate, for each document, both an XML version and a 
PDF/A version that contains embedded text with structural tags, 
because, among other reasons, between them they preserve both 
logical structure and original layout; see 
http://www.digitalpreservation.gov/formats/content/text_preferences.shtml.

But creating two separate files for each document is costly, and 
arguably beyond the means of many research institutions.  LC also 
appears to say that PDF/A or one of the other PDF subtypes alone, 
without XML, meets its digital preservation standards, even if 
the PDF file lacks structural tags.  Though I can't tell whether 
the Google participant library digital copies are in PDF/A or 
another PDF subtype, I can see that they are PDF and that they 
have structural tags, and so they appear to exceed LC's baseline 
digital preservation standard.  So if "best" is defined to mean 
meeting national standards given limited resources, the 
participating libraries arguably satisfied that definition 
respecting building their digital collections.

In terms of extending libraries' collections, if one has 
unlimited resources and can fund all digitization oneself, the 
best way to use digital resources to extend one's public domain 
collections may be to impose no access or distribution 
restrictions on the digital copies.  However, where research 
libraries' cash resources are limited, "best" should arguably be 
defined in terms of the most favorable bargain a library, acting 
in the interests of its parent institution and patrons, can 
strike with a capable digitization outsourcer willing to accept 
noncash consideration.  A deliverable conforming to standards but 
bearing some usage restrictions may well satisfy that definition. 
Respecting in-copyright materials, since rights holders will 
practically always insist on usage restrictions as a condition of 
digitization no matter what the library offers, there's no basis 
for faulting the Google library participants for accepting such 
restrictions on digital copies of copyrighted works.

-- Rob Richards

The preceding comments are not offered as legal advice and do not 
constitute legal advice.

--
Robert C. Richards, Jr., J.D.*, M.A., M.S.L.I.S.
Philadelphia, PA
E-mail: richards1000@comcast.net
* Member, New York Bar, Retired Status