Re: [CODE4LIB] Working with Getty vocabularies

2009-03-02 Thread Dwiggins David
Michael - Thanks for the code snippet -- I will take a stab at putting it into 
practice when I have a few minutes. 
 
Ed -- The TGN can be searched for free on a one-off basis by going to 
 
http://www.getty.edu/research/conducting_research/vocabularies/tgn/ 
 
If you want access to the raw data files (such as to load into a cataloging 
system, provide search capabilities online, or do computerized matching), this 
requires purchasing a license. But, at least in the case of our organization, 
the terms were quite reasonable, and the initial license allows us to get 
updates for five years, and then renew at a reduced rate.
 
-David Dwiggins
 
 
__
 
David Dwiggins
Systems Librarian/Archivist, Historic New England
141 Cambridge Street, Boston, MA 02114
(617) 227-3956 x 242 
ddwigg...@historicnewengland.org 
http://www.historicnewengland.org ( http://www.historicnewengland.org/ )


 Ed Summers e...@pobox.com 2/27/2009 11:06 AM 
The TGN is still behind a pay-firewall right? Not that that means it
isn't legit conversation on here (because it is) -- but just curious
what the current state is.

//Ed

Visit http://www.LymanEstate.org for information on renting the historic Lyman 
Estate for your next event - a very special place for very special occasions.


Re: [CODE4LIB] Working with Getty vocabularies

2009-02-27 Thread Ed Summers
The TGN is still behind a pay-firewall right? Not that that means it
isn't legit conversation on here (because it is) -- but just curious
what the current state is.

//Ed


Re: [CODE4LIB] Working with Getty vocabularies

2009-02-25 Thread Durbin, Michael R
I spent a little time dealing with that set of huge XML files and wrote a crude 
java STaX parser (Streaming API for Xml) that constructed objects as it passed 
through the file, dumping them into a database.  It currently ignores most of 
the content and just captures a few fields (by name and partial path) as it 
hits them, but it easy to extend and has the advantage of not having to load 
those enormous files at once.  Once the information (or subset of the 
information) is in a database, more functionality may be implemented.

Fortunately at the time, the database model was designed around having any 
number of broader or narrower terms... unfortunately it wasn't really designed 
to present such a large hierarchy in a reasonable way.

I've attached the current incarnation of that code which stuffs some fields and 
a ZThes record (using a castor representation of the zthes schema) into a 
lucene index (as opposed to the original database, since this is simpler).  It 
could pretty easily be adapted to only include terms of a certain type 
(potentially excluding 100's of thousands of rivers and streams) and maybe even 
run in a reasonable amount of time.  I've commented out all the portions that 
require lucene or castor, so it only depends on a stax implementation and you 
could plug in whatever database or output format you desired.

As for the higher semantic and usage issues... we haven't really addressed 
those yet.

-Michael Durbin

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of 
Dwiggins David
Sent: Wednesday, February 25, 2009 10:28 AM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: [CODE4LIB] Working with Getty vocabularies

Is there anyone out there with experience processing the raw data files for the 
Getty vocabularies (particularly TGN)?
 
We're adopting AAT and TGN as the primary vocabularies for our new shared 
cataloging system for our museum, library and archival collections. I'm 
presently trying to come up with some scripts to automate matching of places in 
existing databases to places in the TGN taxonomy. But I'm finding that the 
Getty data files are very complex, and I haven't yet figured out a foolproof 
method to do this. I'm curious if anyone else has traveled this road before, 
and if so whether you might be able to share some tips or code snippets.
 
Since most of our place names are going to be in the US, my gut feeling has 
been to first try to extract a list of places in the US and dump things like 
state, county, etc. into discrete database fields that I can match against. But 
I find myself a bit flummoxed by the polyhierarchical nature of the data (where 
one place can belong to multiple higher level places).
 
Another issue is the wide variety of place types in use in the taxonomy. 
England, for example, is a country, but the United States is a nation. This 
makes sense to a degree, but it also makes it a bit hard to figure out which 
term to match when you're trying to automate matching against data where the 
creators were less discerning about this sort of fine distinction.
 
I feel like I'm surely not the first person to tackle this, and would love to 
exchange notes...
 
-David Dwiggins
 
 
 
 
 
 
__
 
David Dwiggins
Systems Librarian/Archivist, Historic New England
141 Cambridge Street, Boston, MA 02114
(617) 227-3956 x 242 
ddwigg...@historicnewengland.org 
http://www.historicnewengland.org ( http://www.historicnewengland.org/ )

Visit http://www.LymanEstate.org for information on renting the historic Lyman 
Estate for your next event - a very special place for very special occasions.


GettyTGNParser.java
Description: GettyTGNParser.java