Re: [CODE4LIB] indexing pdf files
Eric Morgan wrote: http://infomotions.com/highlights/ Rosalyn Metz wrote: I have librarians that would kill for this. In fact I was talking to one about it the other day. She felt there must be a way to handle active reading and make it portable. This would be great in conjunction with RefWorks or Zotero or something along those lines. Yep, when I was creating this application for myself I was wondering what it would be like if a whole group, say, an academic department, were to systematically contribute to such a thing? I thought the output would be pretty exciting. Mark A. Matienzo wrote: Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Nope, never saw that previously. Thanks for the pointer. Peter Kiraly wrote: I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. More tools! Thank you. danielle plumer wrote: My (much more primitive) version of the same thing involves reading and annotating articles using my Tablet PC. Although I do get a variety of print publications, I find I don't tend to annotate them as much anymore. I used to use EndNote to do the metadata, then I switched to Zotero. I hadn't thought to try to create a full-text search of the articles -- hmm. Yes, for a growing number of the tools I create I need to be thinking about Zotero as way of remembering content. Thanks for... reminding me. Erik Hatcher wrote: Here's a post on how easy it is to send PDF documents to Solr from Java: http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ I'm looking forward to the arrival of my Solr books any day now. After reading it I hope to have a better handle on the guts of Solr as well as increase my abilities to do the sorts of things discussed at the URL above. Thank you, one and all for your replies. -- Eric Morgan
Re: [CODE4LIB] indexing pdf files
We're just talking about creating an index, not a separate copy of the works, right? because I imagine that copyright has a lot to do with why this type of thing doesn't already exist. On Wed, Sep 16, 2009 at 3:08 PM, Eric Lease Morgan emor...@nd.edu wrote: Eric Morgan wrote: http://infomotions.com/highlights/ Rosalyn Metz wrote: I have librarians that would kill for this. In fact I was talking to one about it the other day. She felt there must be a way to handle active reading and make it portable. This would be great in conjunction with RefWorks or Zotero or something along those lines. Yep, when I was creating this application for myself I was wondering what it would be like if a whole group, say, an academic department, were to systematically contribute to such a thing? I thought the output would be pretty exciting. Mark A. Matienzo wrote: Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Nope, never saw that previously. Thanks for the pointer. Peter Kiraly wrote: I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. More tools! Thank you. danielle plumer wrote: My (much more primitive) version of the same thing involves reading and annotating articles using my Tablet PC. Although I do get a variety of print publications, I find I don't tend to annotate them as much anymore. I used to use EndNote to do the metadata, then I switched to Zotero. I hadn't thought to try to create a full-text search of the articles -- hmm. Yes, for a growing number of the tools I create I need to be thinking about Zotero as way of remembering content. Thanks for... reminding me. Erik Hatcher wrote: Here's a post on how easy it is to send PDF documents to Solr from Java: http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ I'm looking forward to the arrival of my Solr books any day now. After reading it I hope to have a better handle on the guts of Solr as well as increase my abilities to do the sorts of things discussed at the URL above. Thank you, one and all for your replies. -- Eric Morgan -- Cindy Harper, Systems Librarian Colgate University Libraries char...@colgate.edu 315-228-7363
Re: [CODE4LIB] indexing pdf files
On Sep 16, 2009, at 4:01 PM, Cindy Harper wrote: http://infomotions.com/highlights/ We're just talking about creating an index, not a separate copy of the works, right? because I imagine that copyright has a lot to do with why this type of thing doesn't already exist. No, not just an index, but the real thing as well, usually. The stuff I read falls roughly into two categories: 1) open to the public, and 2) restricted. In my case, the former includes printed articles from Wikipedia, articles from things like DLib Magazine, reports from things like the DLF for public review and consumption. All of these things I read, highlight annotate, and eventually plan to place in my system. Search my system. Find article. Download original and/or download annotated version. The choice is the user's. Items from the second category include closed access articles, articles from Encyclopedia Britannica, internal library reports not intended for outside readers, etc. These items will be read, annotated, lightly reviewed -- just like the other items -- but they will be saved in a restricted space. The user will (usually) be given a URL to the original document, and if they are authorized, then they will be able to get the item. Even though I believe my annotations connote a derivative work -- like the Annotated Alice -- I don't have the hutzpah to make them freely available. In my system I plan to add value to the articles and redistribute them as well as provide links to the original. -- Eric Lease Morgan
[CODE4LIB] indexing pdf files
I have been having fun recently indexing PDF files. For the pasts six months or so I have been keeping the articles I've read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I actively read them -- meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (kewl) to make the articles into a collection. Thus, the beginnings of Highlights Annotations: A Value-Added Reading List. The techno-weenie process for creating and maintaining the content is something this community might find interesting: 1. Print article and read it actively. 2. Convert the printed article into a PDF file -- complete with embedded OCR -- with my handy-dandy ScanSnap scanner. [1] 3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article. [2] 4. Save the PDF to my file system. 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] 6. Provide a searchable/browsable user interface to the collection through a mod_perl module. [5, 6] Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I'd share some geekdom with my fellow hackers. Fun with PDF files and open source software. [1] ScanSnap - http://tinyurl.com/oafgwe [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png [3] pdftotext - http://www.foolabs.com/xpdf/ [4] Solr - http://lucene.apache.org/solr/ [5] module source code - http://infomotions.com/highlights/Highlights.pl [6] user interface - http://infomotions.com/highlights/highlights.cgi -- Eric Lease Morgan University of Notre Dame -- Eric Lease Morgan Head, Digital Access and Information Architecture Department Hesburgh Libraries, University of Notre Dame (574) 631-8604
Re: [CODE4LIB] indexing pdf files
Eric, I have librarians that would kill for this. In fact I was talking to one about it the other day. She felt there must be a way to handle active reading and make it portable. This would be great in conjunction with RefWorks or Zotero or something along those lines. Rosalyn On Tue, Sep 15, 2009 at 9:31 AM, Eric Lease Morgan emor...@nd.edu wrote: I have been having fun recently indexing PDF files. For the pasts six months or so I have been keeping the articles I've read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I actively read them -- meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (kewl) to make the articles into a collection. Thus, the beginnings of Highlights Annotations: A Value-Added Reading List. The techno-weenie process for creating and maintaining the content is something this community might find interesting: 1. Print article and read it actively. 2. Convert the printed article into a PDF file -- complete with embedded OCR -- with my handy-dandy ScanSnap scanner. [1] 3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article. [2] 4. Save the PDF to my file system. 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] 6. Provide a searchable/browsable user interface to the collection through a mod_perl module. [5, 6] Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I'd share some geekdom with my fellow hackers. Fun with PDF files and open source software. [1] ScanSnap - http://tinyurl.com/oafgwe [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png [3] pdftotext - http://www.foolabs.com/xpdf/ [4] Solr - http://lucene.apache.org/solr/ [5] module source code - http://infomotions.com/highlights/Highlights.pl [6] user interface - http://infomotions.com/highlights/highlights.cgi -- Eric Lease Morgan University of Notre Dame -- Eric Lease Morgan Head, Digital Access and Information Architecture Department Hesburgh Libraries, University of Notre Dame (574) 631-8604
Re: [CODE4LIB] indexing pdf files
Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] indexing pdf files
Hi all, I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. Solr uses Tika (http://lucene.apache.org/tika) for extracting text from documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/) to extract from PDF files, and it is a great tool for the normal PDF files, but it has (at least had) some features, which I didn't satisfied with: - it consumed more memory comparing with iText, and couldn't read files above a given size (this was large, about 1 GB, but we had even larger files) - it couldn't handled correctly the conditional hypens at the end of the line - it had poorer documentation then iText, and its API was also poorer (that time the Manning published the iText in Action book). Our PDF files were double layered (original hi-res image + OCR-ed text), several thousands pages length documents (Hungarian scientific journals, the diary of the Houses of Parliament from the 19th century etc.). We indexed the content with Lucene, and in the UI we showed one page per screen, so the user didn't need to download the full PDF. We extracted the Table of contents from the PDF as well, and we implemented it in the web UI, so the user can browse pages according to the full file's TOC. This project happened two years ago, so it is possible, that lots of things were changed since that time. Király Péter http://eXtensibleCatalog.org - Original Message - From: Mark A. Matienzo m...@matienzo.org To: CODE4LIB@LISTSERV.ND.EDU Sent: Tuesday, September 15, 2009 3:56 PM Subject: Re: [CODE4LIB] indexing pdf files Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library
Re: [CODE4LIB] indexing pdf files
My (much more primitive) version of the same thing involves reading and annotating articles using my Tablet PC. Although I do get a variety of print publications, I find I don't tend to annotate them as much anymore. I used to use EndNote to do the metadata, then I switched to Zotero. I hadn't thought to try to create a full-text search of the articles -- hmm. -- Danielle Cunniff Plumer, Coordinator Texas Heritage Digitization Initiative Texas State Library and Archives Commission 512.463.5852 (phone) / 512.936.2306 (fax) dplu...@tsl.state.tx.us dcplu...@gmail.com On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan emor...@nd.edu wrote: I have been having fun recently indexing PDF files. For the pasts six months or so I have been keeping the articles I've read in a pile, and I was rather amazed at the size of the pile. It was about a foot tall. When I read these articles I actively read them -- meaning, I write, scribble, highlight, and annotate the text with my own special notation denoting names, keywords, definitions, citations, quotations, list items, examples, etc. This active reading process: 1) makes for better comprehension on my part, and 2) makes the articles easier to review and pick out the ideas I thought were salient. Being the librarian I am, I thought it might be cool (kewl) to make the articles into a collection. Thus, the beginnings of Highlights Annotations: A Value-Added Reading List. The techno-weenie process for creating and maintaining the content is something this community might find interesting: 1. Print article and read it actively. 2. Convert the printed article into a PDF file -- complete with embedded OCR -- with my handy-dandy ScanSnap scanner. [1] 3. Use MyLibrary to create metadata (author, title, date published, date read, note, keywords, facet/term combinations, local and remote URLs, etc.) describing the article. [2] 4. Save the PDF to my file system. 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] 6. Provide a searchable/browsable user interface to the collection through a mod_perl module. [5, 6] Software is never done, and if it were then it would be called hardware. Accordingly, I know there are some things I need to do before I can truely deem the system version 1.0. At the same time my excitment is overflowing and I thought I'd share some geekdom with my fellow hackers. Fun with PDF files and open source software. [1] ScanSnap - http://tinyurl.com/oafgwe [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png [3] pdftotext - http://www.foolabs.com/xpdf/ [4] Solr - http://lucene.apache.org/solr/ [5] module source code - http://infomotions.com/highlights/Highlights.pl [6] user interface - http://infomotions.com/highlights/highlights.cgi -- Eric Lease Morgan University of Notre Dame -- Eric Lease Morgan Head, Digital Access and Information Architecture Department Hesburgh Libraries, University of Notre Dame (574) 631-8604
Re: [CODE4LIB] indexing pdf files
Here's a post on how easy it is to send PDF documents to Solr from Java: http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ Not only can you post PDF (and other rich content) files to Solr for indexing, you can also as shown in that blog entry extract the text from such files and have it returned to the client. This Solr capability makes the tool chain a bit simpler. Erik On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote: Hi all, I would like to suggest an API for extracting text (including highlighted or annotated ones) from PDF: iText (http://www.lowagie.com/iText/). This is a Java API (has C# port), and it helped me a lot, when we worked with extraordinary PDF files. Solr uses Tika (http://lucene.apache.org/tika) for extracting text from documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/) to extract from PDF files, and it is a great tool for the normal PDF files, but it has (at least had) some features, which I didn't satisfied with: - it consumed more memory comparing with iText, and couldn't read files above a given size (this was large, about 1 GB, but we had even larger files) - it couldn't handled correctly the conditional hypens at the end of the line - it had poorer documentation then iText, and its API was also poorer (that time the Manning published the iText in Action book). Our PDF files were double layered (original hi-res image + OCR-ed text), several thousands pages length documents (Hungarian scientific journals, the diary of the Houses of Parliament from the 19th century etc.). We indexed the content with Lucene, and in the UI we showed one page per screen, so the user didn't need to download the full PDF. We extracted the Table of contents from the PDF as well, and we implemented it in the web UI, so the user can browse pages according to the full file's TOC. This project happened two years ago, so it is possible, that lots of things were changed since that time. Király Péter http://eXtensibleCatalog.org - Original Message - From: Mark A. Matienzo m...@matienzo.org To: CODE4LIB@LISTSERV.ND.EDU Sent: Tuesday, September 15, 2009 3:56 PM Subject: Re: [CODE4LIB] indexing pdf files Eric, 5. Use pdttotext to extract the OCRed text from the PDF and index it along with the MyLibrary metadata using Solr. [3, 4] Have you considered using Solr's ExtractingRequestHandler [1] for the PDFs? We're using it at NYPL with pretty great success. [1] http://wiki.apache.org/solr/ExtractingRequestHandler Mark A. Matienzo Applications Developer, Digital Experience Group The New York Public Library