Wednesday, August 26, 2009

Creating the Corpus 1 -- Pulling Texts

Overview

A lot of my recent work has focused on creating a corpus. The first couple of places I looked were at the Monk Project and Google Books. The Monk database is great because it's now online, and it comes with interfaces for several different analytical tools. Plus, you can download the source code for various analytic java applets and install them on your own computer. The problem, from my perspective, is that the Monk collection is focused upon fiction. And the fiction collection that is accessible for me is limited to American and Shakespeare. If I can get access to the nineteenth-century fiction collection (I'm going to have to consult with my library's subscriptions), what I'll have access to is around 230 British novels. What I really would like is a range of nineteenth-century British works, including periodicals and scientific works.

Google Books gets me closer to that goal. Because Google is basically scanning whole shelves of materials, they've got a huge collection of non-fiction as well as fictional works. Searching for works published between 1800-1900 and keywords including major publication centers London and Edinburgh, Google suggests above 300K hits. Of course, you can't really search for the site of publication, nor will Google return more than 1000 results for a single query. To get more, you have to add more restrictive searches and collate them -- and if the searches are too similar, Google will suspect a bot and those searches out. Finally, (and most crucially), as far as I know, there's no easy way to get the OCRed texts from Google. I've been in contact with Google Books in order to find out whether I can request those out-of-copyright OCRed files (or just raw text files), but the months of lag time between query and response have been discouraging.

Incidentally, I've also looked into the NINES primary works, which is a rich collection, but one that is limited to the ProQuest fiction collections which are also in the Monk datastore.

For these reasons, I've gravitated to the Project Gutenberg collections. The advantages of the Gutenberg texts are their breadth and their accuracy; rather than simply OCRed, the texts are individually verified by volunteers, one page at a time. While this doesn't make them error-free, they are much more accurate than OCR alone (according to Michael Hart, they're above 99.975 accuracy and shooting for 99.99%). Moreover, while the Gutenberg search interface isn't particularly sophisticated, they have a built-in portal to Google search, with all of Google's useful search options.

Googling Project Gutenberg

The direct Google interface for Gutenberg is here near the bottom of the page. What's nice about Google is that it can get around some of the problems of Gutenberg's metadata -- particularly the lack of publication date and location information (which is crucial for me). Because Google crawls the first 100K or so of each Gutenberg file, if that publication information is in the beginning of the Gutenberg file, it will be indexed by Google.

If you've registered for the Google Search API you can do all of this more easily, but for my purposes, I've found the best searches something like:

"English 1800..1899 -browse -1900..1999 -bookshelf"

By changing my Google search default settings, I can get up to 10 pages with 100 entries per page. The 1800..1899 includes files with dates between 1800 and 1900 (inclusive), and cuts out files with later dates (so that a book including an earlier date, but publised later, will not be included). Of course, this risks cutting out later editions of earlier works, but that's the breaks.

Finally, I remove sites with "browse" or "bookshelf," because I want to return pages that are for unique Gutenberg book instances. (Bookshelf and browse pages are more generally index pages.) This restriction ensures that most of the returned websites will include the unique Gutenberg document ID in the URL path. For instance, the first several entries returned by the above search are:


The Cid by Pierre Corneille - Project Gutenberg
Creator, Corneille, Pierre, 1606-1684. Title, The Cid. Language, English. EText-No. 14954. Release Date, 2005-02-07. Copyright Status, Not copyrighted in ...
www.gutenberg.org/etext/14954 - Cached - Similar -
#
Keats: Poems Published in 1820 by John Keats - Project Gutenberg
Dec 2, 2007 ... Title, Keats: Poems Published in 1820. Language, English. LoC Class, PR: Language and Literatures: English literature ...
www.gutenberg.org/etext/23684 - Cached - Similar -
#
Faust by Johann Wolfgang von Goethe - Project Gutenberg
Creator, Goethe, Johann Wolfgang von, 1749-1832. Translator, Taylor, Bayard, 1825-1878. Title, Faust. Language, English ...
www.gutenberg.org/etext/14591 - Cached - Similar -
#
The Great English Short-Story Writers, Volume 1 - Project Gutenberg
Contributor, Twain, Mark, 1835-1910. Title, The Great English Short-Story Writers, Volume 1. Contents, The Apparition of Mrs. Veal, by Daniel Defoe -- The ...
www.gutenberg.org/etext/10135 - Cached - Similar -
#


These results are valuable, because by concatenating and saving the html of all ten result pages (either manually or with some sort of quick automation), I can get a file which includes 1000 unique project Gutenberg IDs corresponding to the search (in this case, 14954, 23684, 14591, and 10135).

I can repeat the search including additional parameters, e.g. "London," "Edinburgh," or "Romance," in an effort to catch results tied to more specific publication sites or genre information. I'll still have to go in and winnow those results further, but this gives a good first cut.

To Follow

In the next few posts, I'll describe how I go about stripping out the Gutenberg document numbers, automating the retrieval of the documents they correspond to, and then converting those documents into XML files (in TEI-Analytics format) that capture some of the metadata (including author, title, publication date, and publication place) from those files.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.