I spent a little time dealing with that set of huge XML files and wrote a crude
java STaX parser (Streaming API for Xml) that constructed objects as it passed
through the file, dumping them into a database. It currently ignores most of
the content and just captures a few fields (by name and partial path) as it
hits them, but it easy to extend and has the advantage of not having to load
those enormous files at once. Once the information (or subset of the
information) is in a database, more functionality may be implemented.
Fortunately at the time, the database model was designed around having any
number of broader or narrower terms... unfortunately it wasn't really designed
to present such a large hierarchy in a reasonable way.
I've attached the current incarnation of that code which stuffs some fields and
a ZThes record (using a castor representation of the zthes schema) into a
lucene index (as opposed to the original database, since this is simpler). It
could pretty easily be adapted to only include terms of a certain type
(potentially excluding 100's of thousands of rivers and streams) and maybe even
run in a reasonable amount of time. I've commented out all the portions that
require lucene or castor, so it only depends on a stax implementation and you
could plug in whatever database or output format you desired.
As for the higher semantic and usage issues... we haven't really addressed
those yet.
-Michael Durbin
-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Dwiggins David
Sent: Wednesday, February 25, 2009 10:28 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Working with Getty vocabularies
Is there anyone out there with experience processing the raw data files for the
Getty vocabularies (particularly TGN)?
We're adopting AAT and TGN as the primary vocabularies for our new shared
cataloging system for our museum, library and archival collections. I'm
presently trying to come up with some scripts to automate matching of places in
existing databases to places in the TGN taxonomy. But I'm finding that the
Getty data files are very complex, and I haven't yet figured out a foolproof
method to do this. I'm curious if anyone else has traveled this road before,
and if so whether you might be able to share some tips or code snippets.
Since most of our place names are going to be in the US, my gut feeling has
been to first try to extract a list of places in the US and dump things like
state, county, etc. into discrete database fields that I can match against. But
I find myself a bit flummoxed by the polyhierarchical nature of the data (where
one place can belong to multiple higher level places).
Another issue is the wide variety of place types in use in the taxonomy.
England, for example, is a country, but the United States is a nation. This
makes sense to a degree, but it also makes it a bit hard to figure out which
term to match when you're trying to automate matching against data where the
creators were less discerning about this sort of fine distinction.
I feel like I'm surely not the first person to tackle this, and would love to
exchange notes...
-David Dwiggins
__
David Dwiggins
Systems Librarian/Archivist, Historic New England
141 Cambridge Street, Boston, MA 02114
(617) 227-3956 x 242
ddwigg...@historicnewengland.org
http://www.historicnewengland.org ( http://www.historicnewengland.org/ )
Visit http://www.LymanEstate.org for information on renting the historic Lyman
Estate for your next event - a very special place for very special occasions.
GettyTGNParser.java
Description: GettyTGNParser.java