Re: [CODE4LIB] Open Data Specialist

2013-10-11 Thread Jodi Schneider
Nate,

Sounds cool! For the collection development angle (and any ancillary data
curation aspects) I'd suggest checking with IASSIST [1]. You might also see
their jobs repository [2] a record of job descriptions posted to the
members' email list from 2005 to the present.

The OKFN might be another good resource, for instance looking through
personnel descriptions [3].

Since you're in a public library context, checking with *municipalities*
that are leading the open data movement might also make sense. So one
question would be: which *local governments* are working on open data? One
is Rotterdam, in the Netherlands [4][5]; I don't know anybody there but you
might ping them.

Anybody know of other *local* open data projects?

-Jodi
[1] http://www.iassistdata.org/about/index.html
[2] http://www.iassistdata.org/resources/jobs/all
[3] http://okfn.org/about/team/
[4] http://www.almende.com/rod-2.0
[5] http://www.rotterdamopendata.nl/


On Tue, Oct 8, 2013 at 1:49 AM, Nate Hill nathanielh...@gmail.com wrote:

 Thanks Ranti!
 I think this is more of an academic library position, right? If anyone on
 the list has a data services librarian in their world I'd love to speak to
 them.
 N

 On Monday, October 7, 2013, Ranti Junus wrote:

  Nate,
 
  For classic collection development activities, you might want to explore
  job descriptions for Data Services librarians.
 
 
  ranti.
 
 
  On Mon, Oct 7, 2013 at 3:47 PM, Nate Hill nathanielh...@gmail.com
 javascript:;
  wrote:
 
   Thanks Toby... that is exactly where I'm starting to look.
   This is a great resource:
   http://project-open-data.github.io/cdo/
   What is complicated is finding anything that relates this to classic
   collection development activities.
   I might have to just make that up!
   If you see anything out there, let me know!
   Cheers
  
  
   On Mon, Oct 7, 2013 at 3:45 PM, Toby Greenwalt 
  theanalogdiv...@gmail.com javascript:;
   wrote:
  
Nate,
   
I'm guessing you're venturing into uncharted territory - at least as
  the
library field is concerned. It's more likely that you'll find more
   relevant
descriptions in either the urban planning or journalism fields. Let
 me
   poke
around and see if I can dig something up.
   
Toby
   
   
On Mon, Oct 7, 2013 at 2:40 PM, Nate Hill nathanielh...@gmail.com
 javascript:;
  
   wrote:
   
 Hi all,

 I'm working on a job description for an Open Data Specialist at my
   public
 library.

 This person would work for the public library as a
 builder/maintainer
   for
 our open data portal (currently an instance of DKAN) which serves
  civic
 (and eventually other) public domain data.

 They would also be an open data evangelist and expert, working with
   other
 city departments to get/keep them involved as contributors of
 useful,
 useable data, etc.

 I'd also like to highlight the collection development-like aspects
 of
   the
 job.

 Has anyone seen a similar job description?

 Thanks

 Nate


 --
 Nate Hill
 nathanielh...@gmail.com javascript:;
 http://4thfloor.chattlibrary.org/
 http://www.natehill.net

   
  
  
  
   --
   Nate Hill
   nathanielh...@gmail.com javascript:;
   http://4thfloor.chattlibrary.org/
   http://www.natehill.net
  
 
 
 
  --
  Bulk mail.  Postage paid.
 


 --
 Nate Hill
 nathanielh...@gmail.com
 http://4thfloor.chattlibrary.org/
 http://www.natehill.net



Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?

2013-10-11 Thread Kyle Banerjee
The question is how the accounts are being deleted. The only ways I can
think of require staff action -- either deleting them directly or
overlaying with bad data.

While it's conceivable that something else is going on, it's more likely
that you either are dealing with a careless or disgruntled current or
former staff member.

kyle


On Thu, Oct 10, 2013 at 5:45 PM, don warner saklad
warnersak...@gmail.comwrote:

 Any other online forums, groups, email lists about difficulties with
 Innovative Interfaces software?...

 Innovative Interfaces Incorporated http://www.iii.com/ is the Integrated
 Library System
 http://en.wikipedia.org/wiki/Integrated_library_systemprovider
 for Minuteman Library Network
 http://www.mln.lib.ma.us/about/about.htm Webmaster scripted replies to
 concerns fail, aren't responsive. Libraries' attempts fail, give up
 attempting to resolve concerns about software.

 Users' records get deleted. No notification before some entries get deleted
 at My Lists  http://www.mln.lib.ma.us/catalog/faq_account.htm#ma50


 On Thu, Oct 10, 2013 at 7:49 PM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:

  I thought you guys have Millennium.
 
  If that is correct, you won't be able to change the behavior of the
 system
  and the only thing you can do is revoke delete permissions for whoever is
  doing it.
 
  kyle
 




 |  What can be done to stop deleting of records belonging to users of our
 Minuteman Library Network in Massachusetts? Or at least notification needs
 to be made before deleting.



Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?

2013-10-11 Thread Matt Amory
I work at a Minuteman Library and I have been in touch with Mr. Saklad
offlist.

Accounts are not being deleted by the careless or disgruntled. We do have
an annual process for deleting inactive accounts based on long-established
criteria.
What Mr Saklad is observing is the effects of the deletion of bibliographic
records in our system.

He asserts that a list of titles which he created before the deletion of
these bib records should remain accessible to him after record deletion.
 I myself am no database professional or policy setter, so I am not
qualified to speak to the actual or ideal merit of his assertion, but I am
assured by my technical staff that the titles and other associated meta are
no longer accessible based on the deleted bib record number.

Based on our previous interactions with Mr. Saklad we have elected not to
bring his comments forward to Innovative Interfaces, nor to make changes to
our policies or processes.


Matt Amory


Matt Amory
(917) 771-4157
matt.am...@gmail.com
http://www.linkedin.com/pub/matt-amory/8/515/239


On Fri, Oct 11, 2013 at 9:01 AM, Kyle Banerjee kyle.baner...@gmail.comwrote:

 The question is how the accounts are being deleted. The only ways I can
 think of require staff action -- either deleting them directly or
 overlaying with bad data.

 While it's conceivable that something else is going on, it's more likely
 that you either are dealing with a careless or disgruntled current or
 former staff member.

 kyle


 On Thu, Oct 10, 2013 at 5:45 PM, don warner saklad
 warnersak...@gmail.comwrote:

  Any other online forums, groups, email lists about difficulties with
  Innovative Interfaces software?...
 
  Innovative Interfaces Incorporated http://www.iii.com/ is the Integrated
  Library System
  http://en.wikipedia.org/wiki/Integrated_library_systemprovider
  for Minuteman Library Network
  http://www.mln.lib.ma.us/about/about.htm Webmaster scripted replies to
  concerns fail, aren't responsive. Libraries' attempts fail, give up
  attempting to resolve concerns about software.
 
  Users' records get deleted. No notification before some entries get
 deleted
  at My Lists  http://www.mln.lib.ma.us/catalog/faq_account.htm#ma50
 
 
  On Thu, Oct 10, 2013 at 7:49 PM, Kyle Banerjee kyle.baner...@gmail.com
  wrote:
 
   I thought you guys have Millennium.
  
   If that is correct, you won't be able to change the behavior of the
  system
   and the only thing you can do is revoke delete permissions for whoever
 is
   doing it.
  
   kyle
  
 
 
 
 
  |  What can be done to stop deleting of records belonging to users of
 our
  Minuteman Library Network in Massachusetts? Or at least notification
 needs
  to be made before deleting.
 



Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?

2013-10-11 Thread don warner saklad
a) Forensics studies deal with how to retrieve deleted unarchived
data. So called deleted data is actually available.

b) Setup the system not to delete records belonging to users. Let
users keep their information saved for followup. Or at the very least
notify users beforehand.

Example. Title wrongfully deleted. Title replaced by 8 places Record
number, lower case letter with 7 numbers.

| You are logged in
https://library.minlib.net/patroninfo~S1/1196412/mylists

My Lists  organize130808 ( 29 )
https://library.minlib.net/patroninfo~S1/1196412/mylists?listNum=31980

Mark   Title   Author   Date Added
[_] Record b2491348 is not available 03-12-2013
[_] Record b2522793 is not available 03-12-2013

[_] Record b2926646 is not available 10-29-2011
[_] Record b2948837 is not available 10-25-2011


[CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan

For a limited period of time I am making publicly available a Web-based program 
called PDF2TXT -- http://bit.ly/1bJRyh8

PDF2TXT extracts the text from an OCRed PDF document and then does some 
rudimentary distant reading against the text in the form of word clouds, 
readability scores, concordance features, and maps (histograms) illustrating 
where terms appear in a text.

Here is the idea behind the application:

  1. In the Libraries I see people scanning, scanning, and
 scanning. I suppose these people then go home and read the
 document. They might even print it. These documents are long.
 Moreover, I'll bet they have multiple documents.

  2. Text mining requires digitized text, but PDF documents are
 frequently full of formatting. At the same time, they often
 have the text underneath. Our scanning software does OCR.

  3. By extracting the text from PDF documents, I can facilitate
 a different -- additional -- type of analysis against sets of
 one or more documents. PDF2TXT is the first step in this
 process.

What is really cool is that PDF2TXT works for many of the articles downloadable 
from the Libraries's article indexes. Search an article index. Download a full 
text, PDF version of the article. Feed it to PDF2TXT. Get more out of your 
article.

PDF2TXT currently has creeping featuritis -- meaning that it is growing in 
weird directions. Your feedback is more than welcome. (I know. The output is 
ugly.) Also, please be gentle with it because it does not process things the 
size of the Bible.

--
[cid:116F6092-2AB6-4E95-8199-25639542726A]

Eric Lease Morgan
Digital Initiatives Librarian

University of Notre Dame
Room 131, Hesburgh Libraries
Notre Dame, IN 46556
o: 574-631-8604
e: emor...@nd.edumailto:emor...@nd.edu

[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]

inline: 116F6092-2AB6-4E95-8199-25639542726A.pnginline: 8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png

[CODE4LIB] Job: Digital Repository Software Developer at Princeton University

2013-10-11 Thread Jon Stroop
Note: this job is in Academic Services at Princeton, not in the Library, 
though we do work together from time to time. The full posting is here:


http://jobs.princeton.edu/applicants/Central?quickFind=64011

Cross-posted. Please excuse any duplicate copies you receive.

*Princeton University seeks Digital Repository Software Developer*

In September of 2011 the Faculty of Princeton University approved an 
open access policy intended to make faculty's scholarly articles 
available to a wider public. Princeton is now in the process of ramping 
up its efforts to implement the policy. These efforts will include the 
development of the repository that will hold the scholarly articles. The 
Office of Information Technology seeks a Digital Repository Software 
Developer to establish and enhance digital repositories to house 
academic publications, research data, and related digital assets.  The 
primary focus of the position will be to develop software and systems 
for collecting and depositing academic journal articles subject to 
Princeton University's Open Access Policy for Faculty Publications into 
an open access repository.  This repository will enhance both the 
preservation and dissemination of scholarship at Princeton.


The Digital Repository Software Developer will report to the Digital 
Repository Architect and will work closely with the University's 
Scholarly Communications Librarian and other IT and Library staff.


--
Jon Stroop
Digital Initiatives Programmer/Analyst
Princeton University Library
jstr...@princeton.edu


Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Peter Murray
Very neat.  I couldn't get the 'network diagram' link to work (from 
http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=searchid=1381506693query=public%20library).
  How hard to you think it would be to do stemming before some of the 
subsequent processing.  The bi-grams public libraries and public library 
are usually the same thing.


Peter

On Oct 11, 2013, at 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote:

 
 For a limited period of time I am making publicly available a Web-based 
 program called PDF2TXT -- http://bit.ly/1bJRyh8
 
 PDF2TXT extracts the text from an OCRed PDF document and then does some 
 rudimentary distant reading against the text in the form of word clouds, 
 readability scores, concordance features, and maps (histograms) 
 illustrating where terms appear in a text.
 
 Here is the idea behind the application:
 
  1. In the Libraries I see people scanning, scanning, and
 scanning. I suppose these people then go home and read the
 document. They might even print it. These documents are long.
 Moreover, I'll bet they have multiple documents.
 
  2. Text mining requires digitized text, but PDF documents are
 frequently full of formatting. At the same time, they often
 have the text underneath. Our scanning software does OCR.
 
  3. By extracting the text from PDF documents, I can facilitate
 a different -- additional -- type of analysis against sets of
 one or more documents. PDF2TXT is the first step in this
 process.
 
 What is really cool is that PDF2TXT works for many of the articles 
 downloadable from the Libraries's article indexes. Search an article index. 
 Download a full text, PDF version of the article. Feed it to PDF2TXT. Get 
 more out of your article.
 
 PDF2TXT currently has creeping featuritis -- meaning that it is growing in 
 weird directions. Your feedback is more than welcome. (I know. The output is 
 ugly.) Also, please be gentle with it because it does not process things the 
 size of the Bible.
 
 --
 [cid:116F6092-2AB6-4E95-8199-25639542726A]
 
 Eric Lease Morgan
 Digital Initiatives Librarian
 
 University of Notre Dame
 Room 131, Hesburgh Libraries
 Notre Dame, IN 46556
 o: 574-631-8604
 e: emor...@nd.edumailto:emor...@nd.edu
 
 [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
 
 116F6092-2AB6-4E95-8199-25639542726A.png8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png

--
Peter Murray
Assistant Director, Technology Services Development
LYRASIS
peter.mur...@lyrasis.org
+1 678-235-2955
800.999.8558 x2955


Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Sean Hannan
Very cool.

But, why only for a limited period of time?

-Sean

On 10/11/13 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote:


For a limited period of time I am making publicly available a Web-based
program called PDF2TXT -- http://bit.ly/1bJRyh8

PDF2TXT extracts the text from an OCRed PDF document and then does some
rudimentary distant reading against the text in the form of word
clouds, readability scores, concordance features, and maps (histograms)
illustrating where terms appear in a text.

Here is the idea behind the application:

  1. In the Libraries I see people scanning, scanning, and
 scanning. I suppose these people then go home and read the
 document. They might even print it. These documents are long.
 Moreover, I'll bet they have multiple documents.

  2. Text mining requires digitized text, but PDF documents are
 frequently full of formatting. At the same time, they often
 have the text underneath. Our scanning software does OCR.

  3. By extracting the text from PDF documents, I can facilitate
 a different -- additional -- type of analysis against sets of
 one or more documents. PDF2TXT is the first step in this
 process.

What is really cool is that PDF2TXT works for many of the articles
downloadable from the Libraries's article indexes. Search an article
index. Download a full text, PDF version of the article. Feed it to
PDF2TXT. Get more out of your article.

PDF2TXT currently has creeping featuritis -- meaning that it is growing
in weird directions. Your feedback is more than welcome. (I know. The
output is ugly.) Also, please be gentle with it because it does not
process things the size of the Bible.

--
[cid:116F6092-2AB6-4E95-8199-25639542726A]

Eric Lease Morgan
Digital Initiatives Librarian

University of Notre Dame
Room 131, Hesburgh Libraries
Notre Dame, IN 46556
o: 574-631-8604
e: emor...@nd.edumailto:emor...@nd.edu

[cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
On Oct 11, 2013, at 11:57 AM, Peter Murray peter.mur...@lyrasis.org wrote:

 For a limited period of time I am making publicly available a Web-based 
 program called PDF2TXT --http://bit.ly/1bJRyh8
 
 Very neat.  I couldn't get the 'network diagram' link to work (from 
 http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=searchid=1381506693query=public%20library).
   How hard to you think it would be to do stemming before some of the 
 subsequent processing.  The bi-grams public libraries and public library 
 are usually the same thing.

Peter, alas, the network diagram is not functional, yet. Stemming before hand, 
hmmm… I suppose that is possible. --Eric Morgan


Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
On Oct 11, 2013, at 12:57 PM, Sean Hannan shan...@jhu.edu wrote:

 For a limited period of time I am making publicly available a Web-based
 program called PDF2TXT -- http://bit.ly/1bJRyh8
 
 Very cool. But, why only for a limited period of time?


Sean, thank you for your support. I'm making it available temporarily because 
allowing readers to upload files to my server is always a scary proposition. 
--ELM


Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Matthew Sherman
Very slick, good work.  I can see where this tool can be very helpful.  It
does have some issues with some characters, but this is rather common with
most systems.


On Fri, Oct 11, 2013 at 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote:


 For a limited period of time I am making publicly available a Web-based
 program called PDF2TXT -- http://bit.ly/1bJRyh8

 PDF2TXT extracts the text from an OCRed PDF document and then does some
 rudimentary distant reading against the text in the form of word clouds,
 readability scores, concordance features, and maps (histograms)
 illustrating where terms appear in a text.

 Here is the idea behind the application:

   1. In the Libraries I see people scanning, scanning, and
  scanning. I suppose these people then go home and read the
  document. They might even print it. These documents are long.
  Moreover, I'll bet they have multiple documents.

   2. Text mining requires digitized text, but PDF documents are
  frequently full of formatting. At the same time, they often
  have the text underneath. Our scanning software does OCR.

   3. By extracting the text from PDF documents, I can facilitate
  a different -- additional -- type of analysis against sets of
  one or more documents. PDF2TXT is the first step in this
  process.

 What is really cool is that PDF2TXT works for many of the articles
 downloadable from the Libraries's article indexes. Search an article index.
 Download a full text, PDF version of the article. Feed it to PDF2TXT. Get
 more out of your article.

 PDF2TXT currently has creeping featuritis -- meaning that it is growing
 in weird directions. Your feedback is more than welcome. (I know. The
 output is ugly.) Also, please be gentle with it because it does not process
 things the size of the Bible.

 --
 [cid:116F6092-2AB6-4E95-8199-25639542726A]

 Eric Lease Morgan
 Digital Initiatives Librarian

 University of Notre Dame
 Room 131, Hesburgh Libraries
 Notre Dame, IN 46556
 o: 574-631-8604
 e: emor...@nd.edumailto:emor...@nd.edu

 [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]




Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Eric Lease Morgan
On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote:

 For a limited period of time I am making publicly available a Web-based
 program called PDF2TXT -- http://bit.ly/1bJRyh8
 
 Very slick, good work.  I can see where this tool can be very helpful.  It
 does have some issues with some characters, but this is rather common with
 most systems.

Again, thank you for the support. Yes, there are some escaping issues to be 
resolved. Release early. Release often. I need help with the graphic design 
in general. 

Here's an enhancement I thought of:

  1. allow readers to authenticate
  2. allow readers to upload documents
  3. documents get saved in readers' cache
  4. allow interface to list documents in the cache
  5. provide text mining services against reader-selected documents
  6. go to Step #1

It would also be cool if I could figure out how to finish the installation of 
Tesseract to enable OCRing. [1]

[1] OCRing - 
http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html

--
Eric Morgan


Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Mark Pernotto
Very cool tool, thank you!

Putting my devil's advocate hat on, it doesn't parse foreign documents well
(I got it to break!).  I also got inconsistent results feeding it PDF files
with tables embedded (but haven't been able to figure out what it is about
them it doesn't like).

Just from a curiosity standpoint, what encoding is being utilized?  I know
nothing about Perl.  It seemed to have no problem parsing a dash (-) if it
was up against another character (2007-2012), but barfs when it's by itself
(2007 � 2012). I'm only referring to 'extracted text' mode.

If it helps, I can send along *most* of my test PDF files used.

Thank you!
.m





On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan emor...@nd.edu wrote:

 On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com
 wrote:

  For a limited period of time I am making publicly available a Web-based
  program called PDF2TXT -- http://bit.ly/1bJRyh8
 
  Very slick, good work.  I can see where this tool can be very helpful.
  It
  does have some issues with some characters, but this is rather common
 with
  most systems.

 Again, thank you for the support. Yes, there are some escaping issues to
 be resolved. Release early. Release often. I need help with the graphic
 design in general.

 Here's an enhancement I thought of:

   1. allow readers to authenticate
   2. allow readers to upload documents
   3. documents get saved in readers' cache
   4. allow interface to list documents in the cache
   5. provide text mining services against reader-selected documents
   6. go to Step #1

 It would also be cool if I could figure out how to finish the installation
 of Tesseract to enable OCRing. [1]

 [1] OCRing -
 http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html

 --
 Eric Morgan



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
You may want to consider how best to handle PDF files where the text would
contain ligatures and glyph ids rather than the underlying characters.

A.
On 12/10/2013 4:58 AM, Eric Lease Morgan emor...@nd.edu wrote:

 On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com
 wrote:

  For a limited period of time I am making publicly available a Web-based
  program called PDF2TXT -- http://bit.ly/1bJRyh8
 
  Very slick, good work.  I can see where this tool can be very helpful.
  It
  does have some issues with some characters, but this is rather common
 with
  most systems.

 Again, thank you for the support. Yes, there are some escaping issues to
 be resolved. Release early. Release often. I need help with the graphic
 design in general.

 Here's an enhancement I thought of:

   1. allow readers to authenticate
   2. allow readers to upload documents
   3. documents get saved in readers' cache
   4. allow interface to list documents in the cache
   5. provide text mining services against reader-selected documents
   6. go to Step #1

 It would also be cool if I could figure out how to finish the installation
 of Tesseract to enable OCRing. [1]

 [1] OCRing -
 http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html

 --
 Eric Morgan



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
Hi Mark,

I suspect the tool wil only be able to handle select languages, and very
doubtful you could develop a tool to handle non-LCG text.

For a fully internationalised tool, you would have fo ignore all text
layers in a PDF and run all PDFs through OCR to generate text.

Then you'd need to apply very sophisticated word boundary identification
routines.

A.
On 12/10/2013 9:40 AM, Mark Pernotto mark.perno...@gmail.com wrote:

 Very cool tool, thank you!

 Putting my devil's advocate hat on, it doesn't parse foreign documents well
 (I got it to break!).  I also got inconsistent results feeding it PDF files
 with tables embedded (but haven't been able to figure out what it is about
 them it doesn't like).

 Just from a curiosity standpoint, what encoding is being utilized?  I know
 nothing about Perl.  It seemed to have no problem parsing a dash (-) if it
 was up against another character (2007-2012), but barfs when it's by itself
 (2007 � 2012). I'm only referring to 'extracted text' mode.

 If it helps, I can send along *most* of my test PDF files used.

 Thank you!
 .m





 On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan emor...@nd.edu
 wrote:

  On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com
  wrote:
 
   For a limited period of time I am making publicly available a
 Web-based
   program called PDF2TXT -- http://bit.ly/1bJRyh8
  
   Very slick, good work.  I can see where this tool can be very helpful.
   It
   does have some issues with some characters, but this is rather common
  with
   most systems.
 
  Again, thank you for the support. Yes, there are some escaping issues to
  be resolved. Release early. Release often. I need help with the graphic
  design in general.
 
  Here's an enhancement I thought of:
 
1. allow readers to authenticate
2. allow readers to upload documents
3. documents get saved in readers' cache
4. allow interface to list documents in the cache
5. provide text mining services against reader-selected documents
6. go to Step #1
 
  It would also be cool if I could figure out how to finish the
 installation
  of Tesseract to enable OCRing. [1]
 
  [1] OCRing -
  http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html
 
  --
  Eric Morgan
 



Re: [CODE4LIB] pdf2txt

2013-10-11 Thread Andrew Cunningham
Perl has its own encoding model, strings vould be unicode or legacy
encoding, unicode is Unicode is indicated by the presence of a flag on a
string. Out its decided on a string by string basis.

If it is a legacy encoding, then it could be any legacy encoding.

If your data is truly multilingual, multiscript and in a variety of
encodings, it becomes a challenge to manage it in Perl.

In our own projects we found perl module to be inadequate and needed our
own internal modules to handle encoding issues, radio when you factor in
the fact that some cpan modules have the nasty habit of stripping the
Unicode flag from strings.

Although that said, Perl still has better Unicode support than most
languages.

A.