Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Bill Janssen
Simon Spero wrote: > Another option is to use the ABBYY FineReader > SDK. > Annoyingly, the linux version is one release behind the windows SDK (which > has improved support for multi core processing of single document). Since > Owen's problem is e

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Simon Spero
PDFBox (and hence Tika) get worse the more recent a version of the PDF format you use. One fun trick they can do is get a tad confused and think there are control characters in extracted metadata fields. Great fun when those characters are then inserted into an XML CMIS response. (Why do I seem t

[CODE4LIB] Presentation on Linked Data in LC's Authorities and Vocabularies Web Service

2011-06-21 Thread Guenther, Rebecca
Note that this was previously announced but we are adding an additional session. LC's Authorities and Vocabularies Web Service: experimenting with Linked Data Rebecca Guenther of the Library of Congress will give a presentation about LC's exploration of controlled vocabularies as Linked Data in

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Bill Janssen
Boheemen, Peter van wrote: > The most used open source software for this (and many other mime > types) is tika: http://tika.apache.org/ While I'm sure it's widely used, it's also relatively immature. For PDF, it just punts to PDFBox (which is also relatively immature). The most widely used com

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Pottinger, Hardy J.
On 6/21/11 12:36 PM, "Boheemen, Peter van" wrote: >The most used open source software for this (and many other mime types) >is tika: http://tika.apache.org/ Thanks for this link, Tika looks great! -- HARDY POTTINGER University of Missouri Library Systems http://lso.umsystem.edu/~pottingerhj/

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Boheemen, Peter van
The most used open source software for this (and many other mime types) is tika: http://tika.apache.org/ Van: Code for Libraries [CODE4LIB@LISTSERV.ND.EDU] namens Bill Janssen [jans...@parc.com] Verzonden: dinsdag 21 juni 2011 19:19 Aan: CODE4LIB@LISTSERV.

Re: [CODE4LIB] OIA Feeds > OAI feeds

2011-06-21 Thread McAulay, Elizabeth
Hi all, UCLA Digital Library Program is running a static repository gateway for the Sheet Music Consortium. It's online at: http://oaigateway.library.ucla.edu/. If anyone wants more information, just let me know. Thanks, Lisa - Elizabeth "Lisa" McAulay Libr

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Bill Janssen
Owen Stephens wrote: > The CORE project at The Open University in the UK is doing some work on > finding similarity between papers in institutional repositories (see > http://core-project.kmi.open.ac.uk/ for more info). The first step in the > process is extracting text from the (mainly) pdf

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Eoghan Ó Carragáin
Hi, As Demian mentioned, we're using Vufind + Virtua at the National Library of Ireland. We use a subset of the USQ driver methods with minor modifications to accommodate our data (e.g. item classes) and our version of Virtua. I'm happy to answer any questions. The Virtua OAI-PMH add-on is pretty

Re: [CODE4LIB] OIA Feeds > OAI feeds

2011-06-21 Thread Robert Robertson
Hi, A number of years ago the now defunct Centre for Digital Library Research at the University of Strathclyde ran a JISC-funded project to investigate the use of Static Repositories. Details and guidance are (currently) available here: http://cdlr.strath.ac.uk/stargate/ there are some signif

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Nathan Tallman
Hmmm, I did call VTLS Support and they told me I would have to upgraded our catalog interface to their latest product to get OAI-PMH. I'll have to read this more closely and call them back. Thanks! Nathan On Tue, Jun 21, 2011 at 11:41 AM, Roy Tennant wrote: > According to this: > > < > http://

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Roy Tennant
According to this: Virtua already supports OAI-PMH. Are you sure you just haven't poked around enough? Or called VTLS suppo

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Demian Katz
I'm aware of at least two Virtua libraries currently using VuFind: USQ (http://library.usq.edu.au/) and the National Library of Ireland (http://catalogue.nli.ie/). VuFind is currently bundled with a copy of USQ's Virtua driver, so it should be possible to get things up and running with minima

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Cook, Randall
You might want to take a look at the XC OAI Toolkit. http://code.google.com/p/xcoaitoolkit/ The Toolkit provides an infrastructure. It was designed to take MARC and convert to MARCXML and serve that data up in an OAI repository, and in XC's end to end system this is the entry point for ILS data

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Timothy Cornwell
I have tangential experience with the free, java-based OAI service from UCAR/DLS. Info here: http://www.ncdc.noaa.gov/oai/ I believe this service will take your xml metadata files and serve them in a number of configurable ways. There are, of course, other implementation details that may be

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Nathan Tallman
Thank you everyone for your replies. Right now, I'm just exploring the options for a potential project. We need to make our MARC records available as Dublin Core via OAI-PMH. We don't have a digital repository or similar infrastructure at the moment, so I'll take a look at the OAI Static Repositor

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Johnston, Leslie
One can set up an OAI "static repository" without a repository infrastructure. It is not without ongoing costs in staff time, exporting metadata records from their source, converting to appropriate XML or other format files, and keeping it updated and synced. There is some static repository ga

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Habing, Thomas Gerald
The University of Illinois Library is still running an OAI static gateway. You can initiate a static repository from here: http://imlsdcc.grainger.uiuc.edu/gateway.net/oai.aspx?initiate=http... Regards, Tom Thomas G. Habing University of Illinois at Urbana-Champaign > -Original Message-

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Bigwood, David
Nathan, I think what you want is a OAI Static Repository. Info here: http://www.openarchives.org/OAI/2.0/guidelines-static-repository.htm If I remember right, you will then need someone else to read your files. Not sure if anyone is still doing that. Sincerely, David Bigwood dbigw...@hou.usra.ed

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Robert Sanderson
Without /any/ infrastructure it would be a challenge, but a simple database that has timestamps and basic metadata would be sufficient. The timestamps are the most important, obviously, to populate the feed correctly and handle the time slicing. Rob On Tue, Jun 21, 2011 at 8:55 AM, Eric Lease Mor

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Demian Katz
Have you tried Aperture (http://aperture.sourceforge.net/)? It's a Java library for extracting content from various document formats including PDF. It comes with command-line scripts that allow you to use it as a stand-alone utility. If performance is your main concern, this may not be the be

[CODE4LIB] Fwd: Software Developer/Research Assistant for R&D in the area of Semantic Web in Life Sciences (DERI)

2011-06-21 Thread Jodi Schneider
Of possible interest... -J Begin forwarded message: > Resent-From: public-semweb-life...@w3.org > From: "Deus, Helena" > Date: 8 June 2011 12:49:59 GMT+01:00 > To: , , > , , > , , > , , > , , , > , > Subject: Software Developer/Research Assistant for R&D in the area of > Semantic Web in L

Re: [CODE4LIB] PDF->text extraction

2011-06-21 Thread Andreas Walker
I'm using Docsplit (http://documentcloud.github.com/docsplit/), due to its Ruby bindings. It includes OCR if it fails at extracting the text, but it also requires you to install a bunch of other (open source) software. Results seem fine to me so far. Best, Andreas Am 21.06.2011 16:23, schrieb

[CODE4LIB] PDF->text extraction

2011-06-21 Thread Owen Stephens
The CORE project at The Open University in the UK is doing some work on finding similarity between papers in institutional repositories (see http://core-project.kmi.open.ac.uk/ for more info). The first step in the process is extracting text from the (mainly) pdf documents harvested from repos

Re: [CODE4LIB] OIA Feeds

2011-06-21 Thread Mark A. Matienzo
What are you trying to do? Or, more appropriately, what kind of data are you intending to put into your feed? Mark A. Matienzo Digital Archivist, Manuscripts and Archives Yale University Library On Tue, Jun 21, 2011 at 9:50 AM, Nathan Tallman wrote: > Greetings list, > > Can anyone direct me to

[CODE4LIB] OIA Feeds

2011-06-21 Thread Nathan Tallman
Greetings list, Can anyone direct me towards documentation on creating an OAI feed from scratch, without a repository infrastructure? Many thanks! Nathan Tallman Associate Archivist American Jewish Archives

Re: [CODE4LIB] MODS in Dspace

2011-06-21 Thread Cary Gordon
Does this help? http://wiki.surffoundation.nl/display/standards/DSpace+to+MODS+mappings On Mon, Jun 20, 2011 at 11:33 PM, david wrote: > Hi to all community list > > We were wondering if there is a way to integrate MODS schema into dspace > same way as it does with METS or PREMIS. > They are aujt

[CODE4LIB] Reference string parsing and document logical structure software available: ParsCit 110505

2011-06-21 Thread Do Hoang Nhat Huy
Dear all: The ParsCit team has also been updating the ParsCit package, and is happy to announce a new version that improves on classification accuracy, especially for general science journals. This version also adds a module that further processes XML files that are the output of the commercial