First pass at extracting useful data from my dissertation

You'll find context in yesterday's post on the dissertation.

It turns out it wasn't as hard as I anticipated to start getting useful information extracted from my born-digital-for-printing-on-dead-trees dissertation. Here's a not-yet-perfect xml serialization (borrowing tags from the TEI) of "instance" information found in the diss narrative:

https://github.com/paregorios/demarc/blob/master/xml/instances.xml

Each instance is a historical event (or in some cases event series) relating to boundary demarcation or dispute within the empire. Here's a comparison between the original formatting for paper and the xml.

For paper:

XML:
<?xml version="1.0" encoding="UTF-8"?>
<div type="instance" xml:id="INST9">
<idno type="original">INST9</idno>
<head>A Negotiated Boundary between the <placeName
type="ancient">Zamucci</placeName> and the <placeName
type="ancient">Muduciuvi</placeName></head>
<p rend="indent">Burton 2000, no. 78</p>
<p>Date(s): <date>AD 86</date></p>
<p type="treDisputeStatement">This boundary marker was placed in
accordance with the agreement of both parties (<foreign xml:lang="la">ex
conven/tione utrarumque nationum</foreign>), and therefore may be taken as
evidence of a <hi rend="bold">boundary dispute</hi>.</p>
<p rend="indent">This single boundary marker from coastal <placeName
type="modern">Libya</placeName> provides the only evidence for the resolution
of a boundary dispute between these two indigenous peoples. The date of the
demarcation, as calculated from the imperial titulature, places the event in
the same year as the reported ‘destruction’ of the <placeName
type="ancient">Nasamones</placeName> by <placeName type="ancient">Legio III
Augusta</placeName> as a consequence of a tax revolt in which tax collectors
were killed.<note n="286"> Zonaras 11.19. </note> It is not clear whether
the boundary action was related to the conflict, or merely took advantage of
the temporary presence of the legionary legate in what ought to have been
part of the proconsular province. Surviving documentation for proconsuls
during the 80s AD is incomplete, and therefore we cannot say who was
governing <placeName type="ancient">Africa Proconsularis </placeName>at the
time of this demarcation.<note n="287"> Thomasson 1996, 45-48. </note>
Neither party seems to have been related to the <placeName
type="ancient">Nasamones</placeName>; rather, they are thought to be sub-
tribes of the <placeName type="ancient">Macae.</placeName><note
n="288">Mattingly 1994, 27-28, 32, 74, 76.. </note></p>
</div>



One thing that made this a lot easier than it might of been was the way I used styles in Microsoft Word back when I created the original version of the document. Rather than just painting formatting onto my text for headings, paragraphs, strings of characters, and so forth, I created a custom "style" for each type of thing I wanted to paint (e.g., an "instance heading" or a "personal name"). I associated the desired visual formatting with each of these, but the names themselves (since the captured semantic distinctions that I was interested in) provided hooks today for writing this stuff out as sort-of TEI XML.

There's more to do, obviously, but this was a satisfying first step.