Wednesday, August 26, 2009

Creating the Corpus 1 -- Pulling Texts

Overview

A lot of my recent work has focused on creating a corpus. The first couple of places I looked were at the Monk Project and Google Books. The Monk database is great because it's now online, and it comes with interfaces for several different analytical tools. Plus, you can download the source code for various analytic java applets and install them on your own computer. The problem, from my perspective, is that the Monk collection is focused upon fiction. And the fiction collection that is accessible for me is limited to American and Shakespeare. If I can get access to the nineteenth-century fiction collection (I'm going to have to consult with my library's subscriptions), what I'll have access to is around 230 British novels. What I really would like is a range of nineteenth-century British works, including periodicals and scientific works.

Google Books gets me closer to that goal. Because Google is basically scanning whole shelves of materials, they've got a huge collection of non-fiction as well as fictional works. Searching for works published between 1800-1900 and keywords including major publication centers London and Edinburgh, Google suggests above 300K hits. Of course, you can't really search for the site of publication, nor will Google return more than 1000 results for a single query. To get more, you have to add more restrictive searches and collate them -- and if the searches are too similar, Google will suspect a bot and those searches out. Finally, (and most crucially), as far as I know, there's no easy way to get the OCRed texts from Google. I've been in contact with Google Books in order to find out whether I can request those out-of-copyright OCRed files (or just raw text files), but the months of lag time between query and response have been discouraging.

Incidentally, I've also looked into the NINES primary works, which is a rich collection, but one that is limited to the ProQuest fiction collections which are also in the Monk datastore.

For these reasons, I've gravitated to the Project Gutenberg collections. The advantages of the Gutenberg texts are their breadth and their accuracy; rather than simply OCRed, the texts are individually verified by volunteers, one page at a time. While this doesn't make them error-free, they are much more accurate than OCR alone (according to Michael Hart, they're above 99.975 accuracy and shooting for 99.99%). Moreover, while the Gutenberg search interface isn't particularly sophisticated, they have a built-in portal to Google search, with all of Google's useful search options.

Googling Project Gutenberg

The direct Google interface for Gutenberg is here near the bottom of the page. What's nice about Google is that it can get around some of the problems of Gutenberg's metadata -- particularly the lack of publication date and location information (which is crucial for me). Because Google crawls the first 100K or so of each Gutenberg file, if that publication information is in the beginning of the Gutenberg file, it will be indexed by Google.

If you've registered for the Google Search API you can do all of this more easily, but for my purposes, I've found the best searches something like:

"English 1800..1899 -browse -1900..1999 -bookshelf"

By changing my Google search default settings, I can get up to 10 pages with 100 entries per page. The 1800..1899 includes files with dates between 1800 and 1900 (inclusive), and cuts out files with later dates (so that a book including an earlier date, but publised later, will not be included). Of course, this risks cutting out later editions of earlier works, but that's the breaks.

Finally, I remove sites with "browse" or "bookshelf," because I want to return pages that are for unique Gutenberg book instances. (Bookshelf and browse pages are more generally index pages.) This restriction ensures that most of the returned websites will include the unique Gutenberg document ID in the URL path. For instance, the first several entries returned by the above search are:


The Cid by Pierre Corneille - Project Gutenberg
Creator, Corneille, Pierre, 1606-1684. Title, The Cid. Language, English. EText-No. 14954. Release Date, 2005-02-07. Copyright Status, Not copyrighted in ...
www.gutenberg.org/etext/14954 - Cached - Similar -
#
Keats: Poems Published in 1820 by John Keats - Project Gutenberg
Dec 2, 2007 ... Title, Keats: Poems Published in 1820. Language, English. LoC Class, PR: Language and Literatures: English literature ...
www.gutenberg.org/etext/23684 - Cached - Similar -
#
Faust by Johann Wolfgang von Goethe - Project Gutenberg
Creator, Goethe, Johann Wolfgang von, 1749-1832. Translator, Taylor, Bayard, 1825-1878. Title, Faust. Language, English ...
www.gutenberg.org/etext/14591 - Cached - Similar -
#
The Great English Short-Story Writers, Volume 1 - Project Gutenberg
Contributor, Twain, Mark, 1835-1910. Title, The Great English Short-Story Writers, Volume 1. Contents, The Apparition of Mrs. Veal, by Daniel Defoe -- The ...
www.gutenberg.org/etext/10135 - Cached - Similar -
#


These results are valuable, because by concatenating and saving the html of all ten result pages (either manually or with some sort of quick automation), I can get a file which includes 1000 unique project Gutenberg IDs corresponding to the search (in this case, 14954, 23684, 14591, and 10135).

I can repeat the search including additional parameters, e.g. "London," "Edinburgh," or "Romance," in an effort to catch results tied to more specific publication sites or genre information. I'll still have to go in and winnow those results further, but this gives a good first cut.

To Follow

In the next few posts, I'll describe how I go about stripping out the Gutenberg document numbers, automating the retrieval of the documents they correspond to, and then converting those documents into XML files (in TEI-Analytics format) that capture some of the metadata (including author, title, publication date, and publication place) from those files.

Tuesday, August 25, 2009

Research Project

So, to paint the scene for my next series of posts, my current research involves using semantic indexing, combined with syntactic models, to look for analogies in nineteenth-century works. I'll explain why at a later point -- for now, I'd just like to lay out the software I've been using, and where I'm taking the project.

My current research, which I presented at ACA 2009 (a computational algebra conference) uses two main suites of tools. For the semantic indexing, I used the tools made available at CU-Boulder's LSA lab. Semantic indexing proceeds by tokenizing a large database words, and getting the term-document frequency counts (counting how many times all of the words occur in each document). Then, using a technique called partial singular value decomposition, this matrix is reduced to a smaller matrix that effectively sifts through the co-occurrence statistics to try and sort out which relationships between the terms are most informative about the structure of the data set. Once you've got this index, you can come up with a rough representation of the meaning of a term or sentence by adding together the singular value vectors for each term. And you can describe differences in meaning in terms of the cosine of the angle between those vectors. The technique has proven very effective at, for instance, naive selection of synonyms.

The other tool I used was a part of speech tagger called Morphadorner, developed by the MONK project group. Morphadorner is just a small part of the software suite underpinning MONK, which includes a relational database and some built-in analysis tools derived from MEANDRE/SEASR. I like Morphadorner because it's both trainable, and comes with a preset for tagging nineteenth-century fiction, which is largely what I'm interested in.

In the short term, I used these tools to do an analysis of the distribution of analogies in the 1859 text of On the Origin of Species, in order to investigate whether this approach could add support to some speculations about the role that analogy plays in that work and in scientific writing generally.

But there are several weaker aspects of this work. First, the semantic indexing tools at the CU Boulder site are limited, particularly by the training corpuses used for their singular value tables. I focused upon a general knowledge training set that covers several years of modern undergraduate course readings, because this seemed to include a better mix of both general and specialist knowledge for looking at scientific works. But it's clearly problematic to use this library for analyzing nineteenth-century science, with its particular idioms, vocabulary, and habits of expression. What I need to do is create my own corpus of nineteenth-century works, preferably including a broad swathe of fictional, periodical, and scientific texts. Additionally, it would be nice if I could slice up that corpus in various ways, in order to examine the differences between, say, fictional and scientific corpuses, or earlier and later.

In addition, I need to do some additional training/verification of Morphadorner to make sure it's tagging nineteenth-century scientific works properly, as well as the fiction. Hence the current project.

First Post

At the suggestion of a new-found friend and digital humanities compatriot (Matt Wilkins), I'm starting up a blog to keep track of my d.h. work, reading, and reflections. Over the next week or so, I'll catch the blog up on the various projects I've been fiddling with over the past year, and suggest some of the avenues I'll be pursuing. My goals are two-fold; to make the work (and the code associated with it) available to other researchers, and to hash out that work in a less formal environment. We'll just have to see where all of this leads.