Nate,
Sounds cool! For the collection development angle (and any ancillary data
curation aspects) I'd suggest checking with IASSIST [1]. You might also see
their jobs repository [2] a record of job descriptions posted to the
members' email list from 2005 to the present.
The OKFN might be another
The question is how the accounts are being deleted. The only ways I can
think of require staff action -- either deleting them directly or
overlaying with bad data.
While it's conceivable that something else is going on, it's more likely
that you either are dealing with a careless or disgruntled
I work at a Minuteman Library and I have been in touch with Mr. Saklad
offlist.
Accounts are not being deleted by the careless or disgruntled. We do have
an annual process for deleting inactive accounts based on long-established
criteria.
What Mr Saklad is observing is the effects of the deletion
a) Forensics studies deal with how to retrieve deleted unarchived
data. So called deleted data is actually available.
b) Setup the system not to delete records belonging to users. Let
users keep their information saved for followup. Or at the very least
notify users beforehand.
Example. Title
For a limited period of time I am making publicly available a Web-based program
called PDF2TXT -- http://bit.ly/1bJRyh8
PDF2TXT extracts the text from an OCRed PDF document and then does some
rudimentary distant reading against the text in the form of word clouds,
readability scores,
Note: this job is in Academic Services at Princeton, not in the Library,
though we do work together from time to time. The full posting is here:
http://jobs.princeton.edu/applicants/Central?quickFind=64011
Cross-posted. Please excuse any duplicate copies you receive.
*Princeton University
Very neat. I couldn't get the 'network diagram' link to work (from
http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=searchid=1381506693query=public%20library).
How hard to you think it would be to do stemming before some of the
subsequent processing. The bi-grams public libraries and
Very cool.
But, why only for a limited period of time?
-Sean
On 10/11/13 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote:
For a limited period of time I am making publicly available a Web-based
program called PDF2TXT -- http://bit.ly/1bJRyh8
PDF2TXT extracts the text from an OCRed PDF
On Oct 11, 2013, at 11:57 AM, Peter Murray peter.mur...@lyrasis.org wrote:
For a limited period of time I am making publicly available a Web-based
program called PDF2TXT --http://bit.ly/1bJRyh8
Very neat. I couldn't get the 'network diagram' link to work (from
On Oct 11, 2013, at 12:57 PM, Sean Hannan shan...@jhu.edu wrote:
For a limited period of time I am making publicly available a Web-based
program called PDF2TXT -- http://bit.ly/1bJRyh8
Very cool. But, why only for a limited period of time?
Sean, thank you for your support. I'm making it
Very slick, good work. I can see where this tool can be very helpful. It
does have some issues with some characters, but this is rather common with
most systems.
On Fri, Oct 11, 2013 at 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote:
For a limited period of time I am making publicly
On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote:
For a limited period of time I am making publicly available a Web-based
program called PDF2TXT -- http://bit.ly/1bJRyh8
Very slick, good work. I can see where this tool can be very helpful. It
does have some
Very cool tool, thank you!
Putting my devil's advocate hat on, it doesn't parse foreign documents well
(I got it to break!). I also got inconsistent results feeding it PDF files
with tables embedded (but haven't been able to figure out what it is about
them it doesn't like).
Just from a
You may want to consider how best to handle PDF files where the text would
contain ligatures and glyph ids rather than the underlying characters.
A.
On 12/10/2013 4:58 AM, Eric Lease Morgan emor...@nd.edu wrote:
On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com
wrote:
Hi Mark,
I suspect the tool wil only be able to handle select languages, and very
doubtful you could develop a tool to handle non-LCG text.
For a fully internationalised tool, you would have fo ignore all text
layers in a PDF and run all PDFs through OCR to generate text.
Then you'd need to
Perl has its own encoding model, strings vould be unicode or legacy
encoding, unicode is Unicode is indicated by the presence of a flag on a
string. Out its decided on a string by string basis.
If it is a legacy encoding, then it could be any legacy encoding.
If your data is truly multilingual,
16 matches
Mail list logo