Re: [CODE4LIB] Open Data Specialist
Nate, Sounds cool! For the collection development angle (and any ancillary data curation aspects) I'd suggest checking with IASSIST [1]. You might also see their jobs repository [2] a record of job descriptions posted to the members' email list from 2005 to the present. The OKFN might be another good resource, for instance looking through personnel descriptions [3]. Since you're in a public library context, checking with *municipalities* that are leading the open data movement might also make sense. So one question would be: which *local governments* are working on open data? One is Rotterdam, in the Netherlands [4][5]; I don't know anybody there but you might ping them. Anybody know of other *local* open data projects? -Jodi [1] http://www.iassistdata.org/about/index.html [2] http://www.iassistdata.org/resources/jobs/all [3] http://okfn.org/about/team/ [4] http://www.almende.com/rod-2.0 [5] http://www.rotterdamopendata.nl/ On Tue, Oct 8, 2013 at 1:49 AM, Nate Hill nathanielh...@gmail.com wrote: Thanks Ranti! I think this is more of an academic library position, right? If anyone on the list has a data services librarian in their world I'd love to speak to them. N On Monday, October 7, 2013, Ranti Junus wrote: Nate, For classic collection development activities, you might want to explore job descriptions for Data Services librarians. ranti. On Mon, Oct 7, 2013 at 3:47 PM, Nate Hill nathanielh...@gmail.com javascript:; wrote: Thanks Toby... that is exactly where I'm starting to look. This is a great resource: http://project-open-data.github.io/cdo/ What is complicated is finding anything that relates this to classic collection development activities. I might have to just make that up! If you see anything out there, let me know! Cheers On Mon, Oct 7, 2013 at 3:45 PM, Toby Greenwalt theanalogdiv...@gmail.com javascript:; wrote: Nate, I'm guessing you're venturing into uncharted territory - at least as the library field is concerned. It's more likely that you'll find more relevant descriptions in either the urban planning or journalism fields. Let me poke around and see if I can dig something up. Toby On Mon, Oct 7, 2013 at 2:40 PM, Nate Hill nathanielh...@gmail.com javascript:; wrote: Hi all, I'm working on a job description for an Open Data Specialist at my public library. This person would work for the public library as a builder/maintainer for our open data portal (currently an instance of DKAN) which serves civic (and eventually other) public domain data. They would also be an open data evangelist and expert, working with other city departments to get/keep them involved as contributors of useful, useable data, etc. I'd also like to highlight the collection development-like aspects of the job. Has anyone seen a similar job description? Thanks Nate -- Nate Hill nathanielh...@gmail.com javascript:; http://4thfloor.chattlibrary.org/ http://www.natehill.net -- Nate Hill nathanielh...@gmail.com javascript:; http://4thfloor.chattlibrary.org/ http://www.natehill.net -- Bulk mail. Postage paid. -- Nate Hill nathanielh...@gmail.com http://4thfloor.chattlibrary.org/ http://www.natehill.net
Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?
The question is how the accounts are being deleted. The only ways I can think of require staff action -- either deleting them directly or overlaying with bad data. While it's conceivable that something else is going on, it's more likely that you either are dealing with a careless or disgruntled current or former staff member. kyle On Thu, Oct 10, 2013 at 5:45 PM, don warner saklad warnersak...@gmail.comwrote: Any other online forums, groups, email lists about difficulties with Innovative Interfaces software?... Innovative Interfaces Incorporated http://www.iii.com/ is the Integrated Library System http://en.wikipedia.org/wiki/Integrated_library_systemprovider for Minuteman Library Network http://www.mln.lib.ma.us/about/about.htm Webmaster scripted replies to concerns fail, aren't responsive. Libraries' attempts fail, give up attempting to resolve concerns about software. Users' records get deleted. No notification before some entries get deleted at My Lists http://www.mln.lib.ma.us/catalog/faq_account.htm#ma50 On Thu, Oct 10, 2013 at 7:49 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: I thought you guys have Millennium. If that is correct, you won't be able to change the behavior of the system and the only thing you can do is revoke delete permissions for whoever is doing it. kyle | What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts? Or at least notification needs to be made before deleting.
Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?
I work at a Minuteman Library and I have been in touch with Mr. Saklad offlist. Accounts are not being deleted by the careless or disgruntled. We do have an annual process for deleting inactive accounts based on long-established criteria. What Mr Saklad is observing is the effects of the deletion of bibliographic records in our system. He asserts that a list of titles which he created before the deletion of these bib records should remain accessible to him after record deletion. I myself am no database professional or policy setter, so I am not qualified to speak to the actual or ideal merit of his assertion, but I am assured by my technical staff that the titles and other associated meta are no longer accessible based on the deleted bib record number. Based on our previous interactions with Mr. Saklad we have elected not to bring his comments forward to Innovative Interfaces, nor to make changes to our policies or processes. Matt Amory Matt Amory (917) 771-4157 matt.am...@gmail.com http://www.linkedin.com/pub/matt-amory/8/515/239 On Fri, Oct 11, 2013 at 9:01 AM, Kyle Banerjee kyle.baner...@gmail.comwrote: The question is how the accounts are being deleted. The only ways I can think of require staff action -- either deleting them directly or overlaying with bad data. While it's conceivable that something else is going on, it's more likely that you either are dealing with a careless or disgruntled current or former staff member. kyle On Thu, Oct 10, 2013 at 5:45 PM, don warner saklad warnersak...@gmail.comwrote: Any other online forums, groups, email lists about difficulties with Innovative Interfaces software?... Innovative Interfaces Incorporated http://www.iii.com/ is the Integrated Library System http://en.wikipedia.org/wiki/Integrated_library_systemprovider for Minuteman Library Network http://www.mln.lib.ma.us/about/about.htm Webmaster scripted replies to concerns fail, aren't responsive. Libraries' attempts fail, give up attempting to resolve concerns about software. Users' records get deleted. No notification before some entries get deleted at My Lists http://www.mln.lib.ma.us/catalog/faq_account.htm#ma50 On Thu, Oct 10, 2013 at 7:49 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: I thought you guys have Millennium. If that is correct, you won't be able to change the behavior of the system and the only thing you can do is revoke delete permissions for whoever is doing it. kyle | What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts? Or at least notification needs to be made before deleting.
Re: [CODE4LIB] What can be done to stop deleting of records belonging to users of our Minuteman Library Network in Massachusetts?
a) Forensics studies deal with how to retrieve deleted unarchived data. So called deleted data is actually available. b) Setup the system not to delete records belonging to users. Let users keep their information saved for followup. Or at the very least notify users beforehand. Example. Title wrongfully deleted. Title replaced by 8 places Record number, lower case letter with 7 numbers. | You are logged in https://library.minlib.net/patroninfo~S1/1196412/mylists My Lists organize130808 ( 29 ) https://library.minlib.net/patroninfo~S1/1196412/mylists?listNum=31980 Mark Title Author Date Added [_] Record b2491348 is not available 03-12-2013 [_] Record b2522793 is not available 03-12-2013 [_] Record b2926646 is not available 10-29-2011 [_] Record b2948837 is not available 10-25-2011
[CODE4LIB] pdf2txt
For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary distant reading against the text in the form of word clouds, readability scores, concordance features, and maps (histograms) illustrating where terms appear in a text. Here is the idea behind the application: 1. In the Libraries I see people scanning, scanning, and scanning. I suppose these people then go home and read the document. They might even print it. These documents are long. Moreover, I'll bet they have multiple documents. 2. Text mining requires digitized text, but PDF documents are frequently full of formatting. At the same time, they often have the text underneath. Our scanning software does OCR. 3. By extracting the text from PDF documents, I can facilitate a different -- additional -- type of analysis against sets of one or more documents. PDF2TXT is the first step in this process. What is really cool is that PDF2TXT works for many of the articles downloadable from the Libraries's article indexes. Search an article index. Download a full text, PDF version of the article. Feed it to PDF2TXT. Get more out of your article. PDF2TXT currently has creeping featuritis -- meaning that it is growing in weird directions. Your feedback is more than welcome. (I know. The output is ugly.) Also, please be gentle with it because it does not process things the size of the Bible. -- [cid:116F6092-2AB6-4E95-8199-25639542726A] Eric Lease Morgan Digital Initiatives Librarian University of Notre Dame Room 131, Hesburgh Libraries Notre Dame, IN 46556 o: 574-631-8604 e: emor...@nd.edumailto:emor...@nd.edu [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5] inline: 116F6092-2AB6-4E95-8199-25639542726A.pnginline: 8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png
[CODE4LIB] Job: Digital Repository Software Developer at Princeton University
Note: this job is in Academic Services at Princeton, not in the Library, though we do work together from time to time. The full posting is here: http://jobs.princeton.edu/applicants/Central?quickFind=64011 Cross-posted. Please excuse any duplicate copies you receive. *Princeton University seeks Digital Repository Software Developer* In September of 2011 the Faculty of Princeton University approved an open access policy intended to make faculty's scholarly articles available to a wider public. Princeton is now in the process of ramping up its efforts to implement the policy. These efforts will include the development of the repository that will hold the scholarly articles. The Office of Information Technology seeks a Digital Repository Software Developer to establish and enhance digital repositories to house academic publications, research data, and related digital assets. The primary focus of the position will be to develop software and systems for collecting and depositing academic journal articles subject to Princeton University's Open Access Policy for Faculty Publications into an open access repository. This repository will enhance both the preservation and dissemination of scholarship at Princeton. The Digital Repository Software Developer will report to the Digital Repository Architect and will work closely with the University's Scholarly Communications Librarian and other IT and Library staff. -- Jon Stroop Digital Initiatives Programmer/Analyst Princeton University Library jstr...@princeton.edu
Re: [CODE4LIB] pdf2txt
Very neat. I couldn't get the 'network diagram' link to work (from http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=searchid=1381506693query=public%20library). How hard to you think it would be to do stemming before some of the subsequent processing. The bi-grams public libraries and public library are usually the same thing. Peter On Oct 11, 2013, at 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary distant reading against the text in the form of word clouds, readability scores, concordance features, and maps (histograms) illustrating where terms appear in a text. Here is the idea behind the application: 1. In the Libraries I see people scanning, scanning, and scanning. I suppose these people then go home and read the document. They might even print it. These documents are long. Moreover, I'll bet they have multiple documents. 2. Text mining requires digitized text, but PDF documents are frequently full of formatting. At the same time, they often have the text underneath. Our scanning software does OCR. 3. By extracting the text from PDF documents, I can facilitate a different -- additional -- type of analysis against sets of one or more documents. PDF2TXT is the first step in this process. What is really cool is that PDF2TXT works for many of the articles downloadable from the Libraries's article indexes. Search an article index. Download a full text, PDF version of the article. Feed it to PDF2TXT. Get more out of your article. PDF2TXT currently has creeping featuritis -- meaning that it is growing in weird directions. Your feedback is more than welcome. (I know. The output is ugly.) Also, please be gentle with it because it does not process things the size of the Bible. -- [cid:116F6092-2AB6-4E95-8199-25639542726A] Eric Lease Morgan Digital Initiatives Librarian University of Notre Dame Room 131, Hesburgh Libraries Notre Dame, IN 46556 o: 574-631-8604 e: emor...@nd.edumailto:emor...@nd.edu [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5] 116F6092-2AB6-4E95-8199-25639542726A.png8DBE3E66-AAD0-40A0-A626-745EEEA175E5.png -- Peter Murray Assistant Director, Technology Services Development LYRASIS peter.mur...@lyrasis.org +1 678-235-2955 800.999.8558 x2955
Re: [CODE4LIB] pdf2txt
Very cool. But, why only for a limited period of time? -Sean On 10/11/13 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary distant reading against the text in the form of word clouds, readability scores, concordance features, and maps (histograms) illustrating where terms appear in a text. Here is the idea behind the application: 1. In the Libraries I see people scanning, scanning, and scanning. I suppose these people then go home and read the document. They might even print it. These documents are long. Moreover, I'll bet they have multiple documents. 2. Text mining requires digitized text, but PDF documents are frequently full of formatting. At the same time, they often have the text underneath. Our scanning software does OCR. 3. By extracting the text from PDF documents, I can facilitate a different -- additional -- type of analysis against sets of one or more documents. PDF2TXT is the first step in this process. What is really cool is that PDF2TXT works for many of the articles downloadable from the Libraries's article indexes. Search an article index. Download a full text, PDF version of the article. Feed it to PDF2TXT. Get more out of your article. PDF2TXT currently has creeping featuritis -- meaning that it is growing in weird directions. Your feedback is more than welcome. (I know. The output is ugly.) Also, please be gentle with it because it does not process things the size of the Bible. -- [cid:116F6092-2AB6-4E95-8199-25639542726A] Eric Lease Morgan Digital Initiatives Librarian University of Notre Dame Room 131, Hesburgh Libraries Notre Dame, IN 46556 o: 574-631-8604 e: emor...@nd.edumailto:emor...@nd.edu [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
Re: [CODE4LIB] pdf2txt
On Oct 11, 2013, at 11:57 AM, Peter Murray peter.mur...@lyrasis.org wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT --http://bit.ly/1bJRyh8 Very neat. I couldn't get the 'network diagram' link to work (from http://dh.crc.nd.edu/sandbox/pdf2txt/pdf2txt.cgi?cmd=searchid=1381506693query=public%20library). How hard to you think it would be to do stemming before some of the subsequent processing. The bi-grams public libraries and public library are usually the same thing. Peter, alas, the network diagram is not functional, yet. Stemming before hand, hmmm… I suppose that is possible. --Eric Morgan
Re: [CODE4LIB] pdf2txt
On Oct 11, 2013, at 12:57 PM, Sean Hannan shan...@jhu.edu wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very cool. But, why only for a limited period of time? Sean, thank you for your support. I'm making it available temporarily because allowing readers to upload files to my server is always a scary proposition. --ELM
Re: [CODE4LIB] pdf2txt
Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. On Fri, Oct 11, 2013 at 11:16 AM, Eric Lease Morgan emor...@nd.edu wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 PDF2TXT extracts the text from an OCRed PDF document and then does some rudimentary distant reading against the text in the form of word clouds, readability scores, concordance features, and maps (histograms) illustrating where terms appear in a text. Here is the idea behind the application: 1. In the Libraries I see people scanning, scanning, and scanning. I suppose these people then go home and read the document. They might even print it. These documents are long. Moreover, I'll bet they have multiple documents. 2. Text mining requires digitized text, but PDF documents are frequently full of formatting. At the same time, they often have the text underneath. Our scanning software does OCR. 3. By extracting the text from PDF documents, I can facilitate a different -- additional -- type of analysis against sets of one or more documents. PDF2TXT is the first step in this process. What is really cool is that PDF2TXT works for many of the articles downloadable from the Libraries's article indexes. Search an article index. Download a full text, PDF version of the article. Feed it to PDF2TXT. Get more out of your article. PDF2TXT currently has creeping featuritis -- meaning that it is growing in weird directions. Your feedback is more than welcome. (I know. The output is ugly.) Also, please be gentle with it because it does not process things the size of the Bible. -- [cid:116F6092-2AB6-4E95-8199-25639542726A] Eric Lease Morgan Digital Initiatives Librarian University of Notre Dame Room 131, Hesburgh Libraries Notre Dame, IN 46556 o: 574-631-8604 e: emor...@nd.edumailto:emor...@nd.edu [cid:8DBE3E66-AAD0-40A0-A626-745EEEA175E5]
Re: [CODE4LIB] pdf2txt
On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan
Re: [CODE4LIB] pdf2txt
Very cool tool, thank you! Putting my devil's advocate hat on, it doesn't parse foreign documents well (I got it to break!). I also got inconsistent results feeding it PDF files with tables embedded (but haven't been able to figure out what it is about them it doesn't like). Just from a curiosity standpoint, what encoding is being utilized? I know nothing about Perl. It seemed to have no problem parsing a dash (-) if it was up against another character (2007-2012), but barfs when it's by itself (2007 � 2012). I'm only referring to 'extracted text' mode. If it helps, I can send along *most* of my test PDF files used. Thank you! .m On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan emor...@nd.edu wrote: On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan
Re: [CODE4LIB] pdf2txt
You may want to consider how best to handle PDF files where the text would contain ligatures and glyph ids rather than the underlying characters. A. On 12/10/2013 4:58 AM, Eric Lease Morgan emor...@nd.edu wrote: On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan
Re: [CODE4LIB] pdf2txt
Hi Mark, I suspect the tool wil only be able to handle select languages, and very doubtful you could develop a tool to handle non-LCG text. For a fully internationalised tool, you would have fo ignore all text layers in a PDF and run all PDFs through OCR to generate text. Then you'd need to apply very sophisticated word boundary identification routines. A. On 12/10/2013 9:40 AM, Mark Pernotto mark.perno...@gmail.com wrote: Very cool tool, thank you! Putting my devil's advocate hat on, it doesn't parse foreign documents well (I got it to break!). I also got inconsistent results feeding it PDF files with tables embedded (but haven't been able to figure out what it is about them it doesn't like). Just from a curiosity standpoint, what encoding is being utilized? I know nothing about Perl. It seemed to have no problem parsing a dash (-) if it was up against another character (2007-2012), but barfs when it's by itself (2007 � 2012). I'm only referring to 'extracted text' mode. If it helps, I can send along *most* of my test PDF files used. Thank you! .m On Fri, Oct 11, 2013 at 10:58 AM, Eric Lease Morgan emor...@nd.edu wrote: On Oct 11, 2013, at 1:49 PM, Matthew Sherman matt.r.sher...@gmail.com wrote: For a limited period of time I am making publicly available a Web-based program called PDF2TXT -- http://bit.ly/1bJRyh8 Very slick, good work. I can see where this tool can be very helpful. It does have some issues with some characters, but this is rather common with most systems. Again, thank you for the support. Yes, there are some escaping issues to be resolved. Release early. Release often. I need help with the graphic design in general. Here's an enhancement I thought of: 1. allow readers to authenticate 2. allow readers to upload documents 3. documents get saved in readers' cache 4. allow interface to list documents in the cache 5. provide text mining services against reader-selected documents 6. go to Step #1 It would also be cool if I could figure out how to finish the installation of Tesseract to enable OCRing. [1] [1] OCRing - http://serials.infomotions.com/code4lib/archive/2013/201303/1554.html -- Eric Morgan
Re: [CODE4LIB] pdf2txt
Perl has its own encoding model, strings vould be unicode or legacy encoding, unicode is Unicode is indicated by the presence of a flag on a string. Out its decided on a string by string basis. If it is a legacy encoding, then it could be any legacy encoding. If your data is truly multilingual, multiscript and in a variety of encodings, it becomes a challenge to manage it in Perl. In our own projects we found perl module to be inadequate and needed our own internal modules to handle encoding issues, radio when you factor in the fact that some cpan modules have the nasty habit of stripping the Unicode flag from strings. Although that said, Perl still has better Unicode support than most languages. A.