Re: [CODE4LIB] Open Data Specialist

2013-10-11 Thread Jodi Schneider
Nate, Sounds cool! For the collection development angle (and any ancillary data curation aspects) I'd suggest checking with IASSIST [1]. You might also see their jobs repository [2] a record of job descriptions posted to the members' email list from 2005 to the present. The OKFN might be another

Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?

2013-10-11 Thread Kyle Banerjee
The question is how the accounts are being deleted. The only ways I can think of require staff action -- either deleting them directly or overlaying with bad data. While it's conceivable that something else is going on, it's more likely that you either are dealing with a careless or disgruntled

Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?

2013-10-11 Thread Matt Amory
I work at a Minuteman Library and I have been in touch with Mr. Saklad offlist. Accounts are not being deleted by the careless or disgruntled. We do have an annual process for deleting inactive accounts based on long-established criteria. What Mr Saklad is observing is the effects of the deletion

Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?

2013-10-11 Thread don warner saklad
a) Forensics studies deal with how to retrieve deleted unarchived data. So called deleted data is actually available. b) Setup the system not to delete records belonging to users. Let users keep their information saved for followup. Or at the very least notify users beforehand. Example. Title

[CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary distant reading against the text in the form of word clouds, readability scores,

[CODE4LIB] Job: Digital Repository Software Developer at Princeton University

2013-10-11 Thread Jon Stroop
Note: this job is in Academic Services at Princeton, not in the Library, though we do work together from time to time. The full posting is here: http://jobs.princeton.edu/applicants/Central?quickFind=64011 Cross-posted. Please excuse any duplicate copies you receive. *Princeton University

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Peter Murray
Very neat. I couldn't get the 'network diagram' link to work (from http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=searchid=1381506693query=public%20library). How hard to you think it would be to do stemming before some of the subsequent processing. The bi-grams public libraries and

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Sean Hannan
Very cool. But, why only for a limited period of time? -Sean On 10/11/13 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 PDF2TXT extracts the text from an OCRed PDF

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
On Oct 11, 2013, at 11:57 AM, Peter Murray peter.mur...@lyrasis.org wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8 Very neat. I couldn't get the 'network diagram' link to work (from

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
On Oct 11, 2013, at 12:57 PM, Sean Hannan shan...@jhu.edu wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very cool. But, why only for a limited period of time? Sean, thank you for your support. I'm making it

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Matthew Sherman
Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. On Fri, Oct 11, 2013 at 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote: For a limited period of time I am making publicly

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Mark Pernotto
Very cool tool, thank you! Putting my devil's advocate hat on, it doesn't parse foreign documents well (I got it to break!). I also got inconsistent results feeding it PDF files with tables embedded (but haven't been able to figure out what it is about them it doesn't like). Just from a

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
You may want to consider how best to handle PDF files where the text would contain ligatures and glyph ids rather than the underlying characters. A. On 12/10/2013 4:58 AM, Eric Lease Morgan emor...@nd.edu wrote: On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote:

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
Hi Mark, I suspect the tool wil only be able to handle select languages, and very doubtful you could develop a tool to handle non-LCG text. For a fully internationalised tool, you would have fo ignore all text layers in a PDF and run all PDFs through OCR to generate text. Then you'd need to

Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
Perl has its own encoding model, strings vould be unicode or legacy encoding, unicode is Unicode is indicated by the presence of a flag on a string. Out its decided on a string by string basis. If it is a legacy encoding, then it could be any legacy encoding. If your data is truly multilingual,