We all know that appropriate standards are required if electronic textual scholarship is to become precisely what it claims to be – scholarly. Enter the TEI, the various debates on its use, and the rest is history – we now have a standard for electronic textual encoding. What next? Well, encoding, what else? Textual encoding is a tedious process, particularly if you are working with a large corpus. Thankfully, Michiel Overtoom set about writing a Python script to automate the conversion of Project Gutenberg plain texts files to a format more suited to his own purposes (this included removal of the Gutenberg boilerplate). Stanford’s Matt Jockers (@mljockers) took this a step further in terms of textual scholarship, adapting Overtoom’s script so that it converts the Gutenberg text to a TEI-compliant XML file.
For those familiar with Python, or scripting in general, you can locate the file at Matt’s site (linked above), and begin your encoding. For those not familiar with Python or scripting, here’s something of a guide to getting the process underway:
1. Download the script. Again, Matt has it hosted at the following web address: http://www.stanford.edu/~mjockers/cgi-bin/drupal/node/49
2. To download the script, find the link towards the end of the post that reads “it’s better to download it here”, right-click, and select “Save Link As”.
3. The default format for the save file will be gutenbergToTei.py.txt – you need to remove the .txt, so that your system knows that the file is to be handled as a Python script.
4. You should now be left with gutenbergToTei.py on your desktop.
5. You now need a script editor of some shape or form. I’m currently working in Mac OS X, so I’m using Xcode, but anything will do (Notepad, TextEdit etc, or alternatively download an editor from here).
6. You need to tell the script two things. Firstly, where the source Gutenberg texts are located, and secondly, where will it locate any TEI XML files that it outputs. This is done at the end of script, where the code reads:
sourcepattern = re.compile(".*.txt$")
sourceDir = "/Path/to/your/ProjectGutenberg/files/"
outputDir = "/Path/to/your/ProjectGutenberg/TEI/Output/files/"
Change the path of both sourceDir and outputDir to reflect your desired input and output folders. For example, I created two folders on my desktop: gutenbergTexts and teiTexts. The folder gutenbergTexts contained all the original plain text files as downloaded from Project Gutenberg – remember when you are downloading these that they should be in .txt format – and the other folder, teiTexts, was empty, as this was where my new TEI-compliant files were to be placed. So my code looked like this:
sourcepattern = re.compile(".*.txt$")
sourceDir = "/Users/james/Desktop/guternbergTexts/"
outputDir = "/Users/james/Desktop/teiTexts/"
7. Having completed this step, when you run your script, your output folder should become populated with TEI-compliant XML versions of the original Gutenberg .txt files.
PROBLEMS THAT MAY BE ENCOUNTERED:
There are a number of problems that may be encountered during this process, the first of which is privileges. If the script does not execute as intended, then you may have to grant yourself execution rights. In Windows, either run the file as administrator, or alternatively, change the file’s privileges in its properties. If you are on a UNIX-based OS, such as Mac OS, you can do the same, or you can do it quickly from the terminal if you’re familiar with its use. To run the terminal, just hit cmd+spacebar and type “terminal” (simple enough).
From within the terminal, locate the file using cd (change directory), and the command chmod u+x gutenbergToTei.py will make the file executable. You can then run it using ./gutenbergToTei.py if you’d like to continue working from the terminal.
You may also encounter an error in any UNIX-based OS due to the script’s lack of a shebang (apparently there isn’t an issue in Windows, but I’ve not yet tried it on my Win7 machine). To rectify this issue, simply re-open the script and add #!/usr/bin/python as the first line, giving you the following:
# Reformats and renames etexts downloaded from Project Gutenberg.
# Software adapted from Michiel Overtoom, email@example.com, july 2009.
# Modified by Matthew Jockers August 17, 2010 to encode result into TEI based XML
Now, it should run, meaning your encoding process has just become a whole lot easier, freeing up time for more interesting scholarly activities.