Re: [CODE4LIB] Query LCSH terms at id.loc.gov by modification date

2010-05-14 Thread Ethan Gruber
Thanks for the help.  It should be doable.  Do you know if it's possible to
control the number of entries per page, or is that locked?

Ethan

On Thu, May 13, 2010 at 6:11 PM, Ed Summers e...@pobox.com wrote:

 As Kevin said, I think you can use the Atom feed to page backwards
 through time. Basically this amounts to programatically following the
 link rel=next links in the feed, applying creates, updates and
 deletes as you go until you make it to Feb. 15, 2010.

 Currently this would involve walking from:

  http://id.loc.gov/authorities/feed/

 to:

  http://id.loc.gov/authorities/feed/page/2/

 all the way to:

  http://id.loc.gov/authorities/feed/page/96/

 Then in a months time or whatever you can run the same process again.
 I think you can either walk through the feed pages until a known last
 harvest date, or until you see a record with an atom:id and
 atom:update you already know about. I think the latter could be a bit
 simpler, assuming you are keeping track of what you have.

 Ever since reading the OAI-ORE specs on Atom [1] I've become a bit
 taken with the idea of using Atom syndication as a drop in replacement
 for OAI-PMH--which is the spec that most people in the library
 community reach for when they want to do metadata synchronization. The
 advantage of Atom is that it fits into the syndication world so
 nicely, and its ecosystem of tools and services.

 //Ed

 [1] http://www.openarchives.org/ore/1.0/atom


 On Thu, May 13, 2010 at 4:53 PM, Kevin Ford k...@loc.gov wrote:
  The short answer to your question is no, there's no way to query terms
 based on last modification date.  However, and this feature needs
 publication on the website, there is an Atom feed that exposes the change
 activities for the subject headings:
 
  http://id.loc.gov/authorities/feed/
 
  You can page through it (feed/page/1, feed/page/2).
 
  There is also a page that shows when each load was performed:
 
  http://id.loc.gov/authorities/loads/
 
  It too has an Atom feed (http://id.loc.gov/authorities/loads/feed).
 
  HTH,
  Kevin



[CODE4LIB] OpenETD, web-based ETD Management Utility now available

2010-05-14 Thread Kalaivani Ananthan

Dear Code4Lib Community,

The Rutgers University Libraries are pleased to announce the 
availability of OpenETD, a web-based software application for managing 
the submission, approval, and distribution of electronic theses and 
dissertations (ETDs). OpenETD is the open source release of the Rutgers 
University Libraries’ RUetd application and will be maintained on the 
RUetd annual release schedule. Releases will include fixes for known 
problems and recommendations for enhancements received from internal 
projects and from the user community at large.


OpenETD can be used as either a standalone ETD submission system, or it 
can be implemented as a component of an institutional repository by 
using its METS/XML export functionality. Using the METS/XML export 
functionality, native to OpenETD, implementers can export acquired ETDs 
to their local institutional repositories for preservation and 
presentation purposes. Highlights of the software include UTF-8 
compliance, configuration and administration support for multiple 
graduate schools, status notifications for student users, support for 
supplementary files, graduation reports, automated page and margin 
validation, and UMI/Proquest delivery support.


To learn more about product features, to view the license, and to 
download the application, please visit:
http://rucore.libraries.rutgers.edu/open/projects/openetd/index.php?sec=press. 



Thanks,
Kalaivani Ananthan

--
Kalaivani Ananthan
Digital Library Applications Manager
Rutgers University Libraries, Systems Department
Technical and Automated Services Building
47 Davidson Road, 
Piscataway, New Jersey 08854 


anant...@rci.rutgers.edu
Phone:(732) 445-5896


Re: [CODE4LIB] Query LCSH terms at id.loc.gov by modification date

2010-05-14 Thread Kevin Ford
Hard-coded.  There's currently no way to pass a type of count parameter.

Cordially,
Kevin


 Ethan Gruber ewg4x...@gmail.com 05/14/10 9:58 AM 
Thanks for the help.  It should be doable.  Do you know if it's possible to
control the number of entries per page, or is that locked?

Ethan

On Thu, May 13, 2010 at 6:11 PM, Ed Summers e...@pobox.com wrote:

 As Kevin said, I think you can use the Atom feed to page backwards
 through time. Basically this amounts to programatically following the
 l!nk rel=next links in the feed, applying creates, updates and
 deletes as you go until you make it to Feb. 15, 2010.

 Currently this would involve walking from:

  http://id.loc.gov/authorities/feed/

 to:

  http://id.loc.gov/authorities/feed/page/2/

 all the way to:

  http://id.loc.gov/authorities/feed/page/96/

 Then in a months time or whatever you can run the same process again.
 I think you can either walk through the feed pages until a known last
 harvest date, or until you see a record with an atom:id and
 atom:update you already know about. I think the latter could be a bit
 simpler, assuming you are keeping track of what you have.

 Ever since reading the OAI-ORE specs on Atom [1] I've become a bit
 taken with the idea of using Atom syndication as a drop in replacement
 for OAI-PMH--which is the spec that most people in the library
 community reach for when they want to do metadata synchronization. The
 advantage of Atom is that it fits into the syndication world so
 nicely, and its ecosystem of tools and services.

 //Ed

 [1] http://www.openarchives.org/ore/1.0/atom


 On Thu, May 13, 2010 at 4:53 PM, Kevin Ford k...@loc.gov wrote:
  The short answer to your question is no, there's no way to query terms
 based on last modification date.  However, and this feature needs
 publication on the website, there is an Atom feed that exposes the change
 activities for the subject headings:
 
  http://id.loc.gov/authorities/feed/
 
  You can page through it (feed/page/1, feed/page/2).
 
  There is also a page that shows when each load was performed:
 
  http://id.loc.gov/authorities/loads/
 
  It too has an Atom feed (http://id.loc.gov/authorities/loads/feed).
 
  HTH,
  Kevin



Re: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?

2010-05-14 Thread Király Péter

Hi Fernando,

I have started my experience with MARC in Mongo. I have import ~6 million
MARC records (auth and bib) to MongoDB. The steps I took:

1) the source was MARCXML I created with XC OAI Toolkit.
2) I created an XSLT file which creates MARC-JSON from MARCXML
I followed the MARC-JSON draft and not Bill Dueber's MARC HASH
http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11. The
conversion is not 100% perfect, but from the 6 million records only 20
were converted with some errors, which is enogh error rate for a home
made project.
3) imported the files
4) indexed the files

Lessons learned:
- the import process is moch more quicker than any other part of the 
workflow.

The 6 million records was imported about 30 minutes, while indexing took
3 hours.
- count() is very slow method for complex queries even after intensive 
indexing.

but iterating over the results is more quicker.
- there is no way to index part of strings (e.g. splitting the leader or 
006/007/008

fields)
- full text search is not too quick
- before indexing the size of the index was 9 GB, after full index it was 28 
GB

(I should note, that on 32-bit operation system the max size of mongo index
is 2 GB).

Conclusions:
- the MARC-JSON format is good for data exchange, but it is not enough 
precise
for searching, since - MARC heritage - distinct information are combined 
together to
single fields (Leader, 008 etc). We should split them into smaller 
information chunks

before indexing.
- I should learn more about the possibilities of MongoDB

I can give you more technical details, if you interested.

Péter
eXtensible Catalog


- Original Message - 
From: Fernando Gómez fjgo...@gmail.com

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Thursday, May 13, 2010 2:59 PM
Subject: [CODE4LIB] Indexing MARC(-JSON) with MongoDB?



There's been some talk in code4lib about using MongoDB to store MARC
records in some kind of JSON format. I'd like to know if you have
experimented with indexing those documents in MongoDB. From my limited
exposure to MongoDB, it seems difficult, unless MongoDB supports some
kind of custom indexing functionality.

According to the MongoDB docs [1], you can create an index by calling
the ensureIndex() function, and providing a document that specifies
one or more keys to index. Examples of this are:

   db.things.ensureIndex({city: 1})
   db.things.ensureIndex({address.city: 1})

That is, you specify the keys giving a path from the root of the
document to the data element you are interested in. Such a path acts
both as the index's name, and as an specification of how to get the
keys's values.

In the case of two proposed MARC-JSON formats [2, 3], I can't see such
path. For example, say you want an index on field 001. Simplifying,
the JSON docs would look like this

   { fields : [ [001, 001 value], ... ] }

or this

   { controlfield : [ { tag : 001, data : fst01312614 }, ... ] }

How would you specify field 001 to MongoDB?

It would be nice to have some kind of custom indexing, where one could
provide an index name and separately a JavaScript function specifying
how to obtain the keys's values for that index.

Any suggestions? Do other document oriented databases offer a better
solution for this?


BTW, I fed MongoDB with the example MARC records in [2] and [3], and
it choked on them. Both are missing some commas :-)


[1] http://www.mongodb.org/display/DOCS/Indexes
[2] http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/
[3] http://worldcat.org/devnet/wiki/MARC-JSON_Draft_2010-03-11


--
Fernando Gómez
Biblioteca Antonio Monteiro
INMABB (Conicet / Universidad Nacional del Sur)
Av. Alem 1253
B8000CPB Bahía Blanca, Argentina
Tel. +54 (291) 459 5116
http://inmabb.criba.edu.ar/



[CODE4LIB] internet archive experiment

2010-05-14 Thread Eric Lease Morgan
We are doing a tiny experiment here at Notre Dame with the Internet Archive, 
specifically, we are determining whether or not we can supplement a special 
collection with full text content.

We are hosting at site colloquially called the Catholic Portal -- a collection 
of rare, infrequently held, and uncommon materials of a Catholic nature. [1] 
Much of the content of the Portal is metadata -- MARC and EAD records/files. I 
think the Portal would be more useful if it contained full text content. If it 
did, then indexing would be improved and services against the texts could be 
implemented.

How can we get full text content? This is what we are going to try:

  1. parse out identifying information from
 metadata (author names, titles, dates,
 etc.)

  2. construct a URL in the form of a
 Advanced Search query and send it to the
 Archive

  3. get back a list of matches in an XML
 format

  4. parse the result looking for the best
 matches

  5. save Internet Archive keys identifying
 full text items

  6. mirror Internet Archive content locally
 using keys as pointers

  7. update local metadata files pointing to
 Archive content as well as locally
 mirrored content

  8. re-index local metadata

If we are (somewhat) successful, then search results would not only have 
pointers to the physical items, but they would also have pointers to the 
digitized items. Not only could they have pointers to the digitized items, but 
they could also have pointers to services against the texts such as make word 
cloud, display concordance, plot word/phrase frequency, etc. These later 
services are spaces where I think there is great potential for librarianship.

Frankly, because of the Portal's collection policy, I don't expect to find very 
much material. On the other hand, the same process could be applied to more 
generic library collections where more content may have already been digitized. 

Wish us luck.

[1] Catholic Portal - http://www.catholicresearch.net/
[2] Advanced search - http://www.archive.org/advancedsearch.php

-- 
Eric Lease Morgan
University of Notre Dame


Re: [CODE4LIB] Query LCSH terms at id.loc.gov by modification date

2010-05-14 Thread Ethan Gruber
Thanks for the tips.  I got it working with an XSLT stylesheet, which I have
attached for those who are interested.

You can generate a test xml file with the command line:

java -jar /path/to/saxon9.jar -s
http://id.loc.gov/authorities/feed/page/1/-xsl:/path/to/update-lcsh.xsl
date=2010-04-15  test.xml

where date=2010-04-15 is a parameter that you can change.  It is passed to
the stylesheet and Saxon steps through the pages of the feed, extracting
those entries created after 2010-04-14.  I found it to be pretty fast.  I
can easily integrate this into an Orbeon pipeline to keep my Solr index of
LCSH terms up to date.

Ethan

On Fri, May 14, 2010 at 10:45 AM, Kevin Ford k...@loc.gov wrote:

 Hard-coded.  There's currently no way to pass a type of count parameter.

 Cordially,
 Kevin


  Ethan Gruber ewg4x...@gmail.com 05/14/10 9:58 AM 
 Thanks for the help.  It should be doable.  Do you know if it's possible to
 control the number of entries per page, or is that locked?

 Ethan

 On Thu, May 13, 2010 at 6:11 PM, Ed Summers e...@pobox.com wrote:

  As Kevin said, I think you can use the Atom feed to page backwards
  through time. Basically this amounts to programatically following the
  l!nk rel=next links in the feed, applying creates, updates and
  deletes as you go until you make it to Feb. 15, 2010.
 
  Currently this would involve walking from:
 
   http://id.loc.gov/authorities/feed/
 
  to:
 
   http://id.loc.gov/authorities/feed/page/2/
 
  all the way to:
 
   http://id.loc.gov/authorities/feed/page/96/
 
  Then in a months time or whatever you can run the same process again.
  I think you can either walk through the feed pages until a known last
  harvest date, or until you see a record with an atom:id and
  atom:update you already know about. I think the latter could be a bit
  simpler, assuming you are keeping track of what you have.
 
  Ever since reading the OAI-ORE specs on Atom [1] I've become a bit
  taken with the idea of using Atom syndication as a drop in replacement
  for OAI-PMH--which is the spec that most people in the library
  community reach for when they want to do metadata synchronization. The
  advantage of Atom is that it fits into the syndication world so
  nicely, and its ecosystem of tools and services.
 
  //Ed
 
  [1] http://www.openarchives.org/ore/1.0/atom
 
 
  On Thu, May 13, 2010 at 4:53 PM, Kevin Ford k...@loc.gov wrote:
   The short answer to your question is no, there's no way to query
 terms
  based on last modification date.  However, and this feature needs
  publication on the website, there is an Atom feed that exposes the change
  activities for the subject headings:
  
   http://id.loc.gov/authorities/feed/
  
   You can page through it (feed/page/1, feed/page/2).
  
   There is also a page that shows when each load was performed:
  
   http://id.loc.gov/authorities/loads/
  
   It too has an Atom feed (http://id.loc.gov/authorities/loads/feed).
  
   HTH,
   Kevin
 

?xml version=1.0 encoding=UTF-8?
xsl:stylesheet xmlns:xsl=http://www.w3.org/1999/XSL/Transform;
	xmlns:atom=xmlns='http://www.w3.org/2005/Atom' xmlns:xs=http://www.w3.org/2001/XMLSchema;
	xmlns:dcterms=http://purl.org/dc/terms/; xmlns:at=http://purl.org/atompub/tombstones/1.0;
	exclude-result-prefixes=xs dcterms at atom version=2.0
	
	xsl:output method=xml encoding=UTF-8/

	xsl:param name=date/
	
	xsl:param name=year select=substring-before($date, '-')/
	xsl:param name=month select=substring-before(substring-after($date, '-'), '-')/
	xsl:param name=day select=substring-after(substring-after($date, '-'), '-')/

	xsl:template match=/
		add
			xsl:apply-templates select=//node()[local-name() = 'entry']/
		/add
	/xsl:template

	xsl:template match=*[local-name() = 'entry']
		xsl:variable name=local-date select=substring-before(node()[local-name() = 'updated'], 'T')/
		xsl:variable name=local-year select=substring-before($local-date, '-')/
		xsl:variable name=local-month
			select=substring-before(substring-after($local-date, '-'), '-')/
		xsl:variable name=local-day
			select=substring-after(substring-after($local-date, '-'), '-')/
		xsl:if test=$local-year gt;= $year
			xsl:if test=$local-month gt;= $month
xsl:if test=$local-day gt;= $day
	doc
		field name=id
			xsl:value-of select=substring-after(node()[local-name() = 'id'], 'authorities/')/
		/field
		field name=subject
			xsl:value-of select=node()[local-name() = 'title']/
		/field
		field name=created
			xsl:value-of select=dcterms:created/
		/field
		field name=modified
			xsl:value-of select=node()[local-name() = 'updated']/
		/field
	/doc
	xsl:if test=position() = last()
		xsl:variable name=next select=//node()[local-name() = 'link']...@rel='next']/@href/
		xsl:apply-templates select=document($next)//node()[local-name() = 'entry']/
	/xsl:if
/xsl:if
			/xsl:if
		/xsl:if
	/xsl:template

/xsl:stylesheet


Re: [CODE4LIB] internet archive experiment

2010-05-14 Thread Graham Stewart

Hi,

I may be able to assist you with the content mirroring part of this. 
The University of Toronto Libraries hosts one of the Internet Archive 
scanning operations through the Open Content Alliance and we host 
content originally scanned by the Archive through the OCUL 
Scholarsportal project at this URL:  http://books.scholarsportal.info


In order to retrieve content from the IA (since it is sent immediately 
to San Francisco as it is scanned) I've written a set of scripts that 
download content based on various parameters.


-the starting point is a list of IA identifiers and other metadata 
pulled from an advanced search query.


-from those which file types you want to download (*.pdf, *_marc.xml, 
*.djvu, *_meta.xml, etc.) can be specified.


-The downloads are then queued and retrieved to specified local file 
systems.


The system uses a mysql backend, perl, and curl for http downloads, with 
an option for rsync.  Designed to run on Linux systems.  It contains 
fairly sophisticated tools for checking download success, file size 
comparison with the Archive, md5 error checking, re-running against the 
Archive in case content changes, and can be adapted to a variety of needs.


So far we've downloaded about 400,000 pdfs and associated metadata 
(about 14 TB altogether).  It could be used, however to, for example, 
just download marc records for integration into an ILS (a separate 
challenge, of course), and to build pointers to the archive's content 
for the fulltext.


Have had plans to open source it for some time, but other work always 
gets in the way.  If you (or anyone) want to take a look and try it out, 
just let me know.


--
Graham Stewart  graham.stew...@utoronto.ca  416-550-2806
Network and Storage Services Manager, Information Technology Services
University of Toronto Libraries
130 St. George Street
Toronto, Ontario, Canada M5S 1A5

On 10-05-14 03:34 PM, Eric Lease Morgan wrote:

We are doing a tiny experiment here at Notre Dame with the Internet Archive, 
specifically, we are determining whether or not we can supplement a special 
collection with full text content.

We are hosting at site colloquially called the Catholic Portal -- a collection 
of rare, infrequently held, and uncommon materials of a Catholic nature. [1] 
Much of the content of the Portal is metadata -- MARC and EAD records/files. I 
think the Portal would be more useful if it contained full text content. If it 
did, then indexing would be improved and services against the texts could be 
implemented.

How can we get full text content? This is what we are going to try:

   1. parse out identifying information from
  metadata (author names, titles, dates,
  etc.)

   2. construct a URL in the form of a
  Advanced Search query and send it to the
  Archive

   3. get back a list of matches in an XML
  format

   4. parse the result looking for the best
  matches

   5. save Internet Archive keys identifying
  full text items

   6. mirror Internet Archive content locally
  using keys as pointers

   7. update local metadata files pointing to
  Archive content as well as locally
  mirrored content

   8. re-index local metadata

If we are (somewhat) successful, then search results would not only have pointers to the 
physical items, but they would also have pointers to the digitized items. Not only could 
they have pointers to the digitized items, but they could also have pointers to 
services against the texts such as make word cloud, display concordance, plot 
word/phrase frequency, etc. These later services are spaces where I think there is great 
potential for librarianship.

Frankly, because of the Portal's collection policy, I don't expect to find very 
much material. On the other hand, the same process could be applied to more 
generic library collections where more content may have already been digitized.

Wish us luck.

[1] Catholic Portal - http://www.catholicresearch.net/
[2] Advanced search - http://www.archive.org/advancedsearch.php