Re: [CODE4LIB] indexing pdf files

2009-09-16 Thread Eric Lease Morgan

Eric Morgan wrote:


http://infomotions.com/highlights/




Rosalyn Metz wrote:


I have librarians that would kill for this.  In fact I was talking to
one about it the other day.  She felt there must be a way to handle
active reading and make it portable.  This would be great in
conjunction with RefWorks or Zotero or something along those lines.



Yep, when I was creating this application for myself I was wondering  
what it would be like if a whole group, say, an academic department,  
were to systematically contribute to such a thing? I thought the  
output would be pretty exciting.



Mark A. Matienzo wrote:


Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler


Nope, never saw that previously. Thanks for the pointer.


Peter Kiraly wrote:

I would like to suggest an API for extracting text (including  
highlighted or

annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we  
worked

with extraordinary PDF files.


More tools! Thank you.


danielle plumer wrote:

My (much more primitive) version of the same thing involves reading  
and
annotating articles using my Tablet PC. Although I do get a variety  
of print
publications, I find I don't tend to annotate them as much anymore.  
I used

to use EndNote to do the metadata, then I switched to Zotero. I hadn't
thought to try to create a full-text search of the articles -- hmm.


Yes, for a growing number of the tools I create I need to be thinking  
about Zotero as way of remembering content. Thanks for... reminding  
me.



Erik Hatcher wrote:

Here's a post on how easy it is to send PDF documents to Solr from  
Java:


  
http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/


I'm looking forward to the arrival of my Solr books any day now. After  
reading it I hope to have a better handle on the guts of Solr as well  
as increase my abilities to do the sorts of things discussed at the  
URL above.



Thank you, one and all for your replies.

--
Eric Morgan


Re: [CODE4LIB] indexing pdf files

2009-09-16 Thread Cindy Harper
We're just talking about creating an index, not a separate copy of the
works, right?  because I imagine that copyright has a lot to do with why
this type of thing doesn't already exist.

On Wed, Sep 16, 2009 at 3:08 PM, Eric Lease Morgan emor...@nd.edu wrote:

 Eric Morgan wrote:

  http://infomotions.com/highlights/




 Rosalyn Metz wrote:

  I have librarians that would kill for this.  In fact I was talking to
 one about it the other day.  She felt there must be a way to handle
 active reading and make it portable.  This would be great in
 conjunction with RefWorks or Zotero or something along those lines.



 Yep, when I was creating this application for myself I was wondering what
 it would be like if a whole group, say, an academic department, were to
 systematically contribute to such a thing? I thought the output would be
 pretty exciting.


 Mark A. Matienzo wrote:

  Have you considered using Solr's ExtractingRequestHandler [1] for the
 PDFs? We're using it at NYPL with pretty great success.

 [1] http://wiki.apache.org/solr/ExtractingRequestHandler


 Nope, never saw that previously. Thanks for the pointer.


 Peter Kiraly wrote:

  I would like to suggest an API for extracting text (including highlighted
 or
 annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
 This is a Java API (has C# port), and it helped me a lot, when we worked
 with extraordinary PDF files.


 More tools! Thank you.


 danielle plumer wrote:

  My (much more primitive) version of the same thing involves reading and
 annotating articles using my Tablet PC. Although I do get a variety of
 print
 publications, I find I don't tend to annotate them as much anymore. I used
 to use EndNote to do the metadata, then I switched to Zotero. I hadn't
 thought to try to create a full-text search of the articles -- hmm.


 Yes, for a growing number of the tools I create I need to be thinking about
 Zotero as way of remembering content. Thanks for... reminding me.


 Erik Hatcher wrote:

  Here's a post on how easy it is to send PDF documents to Solr from Java:

  
 http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/


 I'm looking forward to the arrival of my Solr books any day now. After
 reading it I hope to have a better handle on the guts of Solr as well as
 increase my abilities to do the sorts of things discussed at the URL above.


 Thank you, one and all for your replies.

 --
 Eric Morgan




-- 
Cindy Harper, Systems Librarian
Colgate University Libraries
char...@colgate.edu
315-228-7363


Re: [CODE4LIB] indexing pdf files

2009-09-16 Thread Eric Lease Morgan

On Sep 16, 2009, at 4:01 PM, Cindy Harper wrote:


http://infomotions.com/highlights/


We're just talking about creating an index, not a separate copy of the
works, right?  because I imagine that copyright has a lot to do with  
why

this type of thing doesn't already exist.



No, not just an index, but the real thing as well, usually.

The stuff I read falls roughly into two categories: 1) open to the  
public, and 2) restricted. In my case, the former includes printed  
articles from Wikipedia, articles from things like DLib Magazine,  
reports from things like the DLF for public review and consumption.  
All of these things I read, highlight  annotate, and eventually plan  
to place in my system. Search my system. Find article. Download  
original and/or download annotated version. The choice is the user's.


Items from the second category include closed access articles,  
articles from Encyclopedia Britannica, internal library reports not  
intended for outside readers, etc. These items will be read,  
annotated, lightly reviewed -- just like the other items -- but they  
will be saved in a restricted space. The user will (usually) be given  
a URL to the original document, and if they are authorized, then they  
will be able to get the item. Even though I believe my annotations  
connote a derivative work -- like the Annotated Alice -- I don't have  
the hutzpah to make them freely available.


In my system I plan to add value to the articles and redistribute  
them as well as provide links to the original.


--
Eric Lease Morgan


[CODE4LIB] indexing pdf files

2009-09-15 Thread Eric Lease Morgan

I have been having fun recently indexing PDF files.

For the pasts six months or so I have been keeping the articles I've  
read in a pile, and I was rather amazed at the size of the pile. It  
was about a foot tall. When I read these articles I actively read  
them -- meaning, I write, scribble, highlight, and annotate the text  
with my own special notation denoting names, keywords, definitions,  
citations, quotations, list items, examples, etc. This active reading  
process: 1) makes for better comprehension on my part, and 2) makes  
the articles easier to review and pick out the ideas I thought were  
salient. Being the librarian I am, I thought it might be cool (kewl)  
to make the articles into a collection. Thus, the beginnings of  
Highlights  Annotations: A Value-Added Reading List.


The techno-weenie process for creating and maintaining the content is  
something this community might find interesting:


 1. Print article and read it actively.

 2. Convert the printed article into a PDF
file -- complete with embedded OCR --
with my handy-dandy ScanSnap scanner. [1]

 3. Use MyLibrary to create metadata (author,
title, date published, date read, note,
keywords, facet/term combinations, local
and remote URLs, etc.) describing the
article. [2]

 4. Save the PDF to my file system.

 5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]

 6. Provide a searchable/browsable user
interface to the collection through a
mod_perl module. [5, 6]

Software is never done, and if it were then it would be called  
hardware. Accordingly, I know there are some things I need to do  
before I can truely deem the system version 1.0. At the same time my  
excitment is overflowing and I thought I'd share some geekdom with my  
fellow hackers. Fun with PDF files and open source software.



[1] ScanSnap - http://tinyurl.com/oafgwe
[2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
[3] pdftotext - http://www.foolabs.com/xpdf/
[4] Solr - http://lucene.apache.org/solr/
[5] module source code - http://infomotions.com/highlights/Highlights.pl
[6] user interface - http://infomotions.com/highlights/highlights.cgi

--
Eric Lease Morgan
University of Notre Dame




--
Eric Lease Morgan
Head, Digital Access and Information Architecture Department
Hesburgh Libraries, University of Notre Dame

(574) 631-8604


Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Rosalyn Metz
Eric,

I have librarians that would kill for this.  In fact I was talking to
one about it the other day.  She felt there must be a way to handle
active reading and make it portable.  This would be great in
conjunction with RefWorks or Zotero or something along those lines.

Rosalyn



On Tue, Sep 15, 2009 at 9:31 AM, Eric Lease Morgan emor...@nd.edu wrote:
 I have been having fun recently indexing PDF files.

 For the pasts six months or so I have been keeping the articles I've read in
 a pile, and I was rather amazed at the size of the pile. It was about a foot
 tall. When I read these articles I actively read them -- meaning, I write,
 scribble, highlight, and annotate the text with my own special notation
 denoting names, keywords, definitions, citations, quotations, list items,
 examples, etc. This active reading process: 1) makes for better
 comprehension on my part, and 2) makes the articles easier to review and
 pick out the ideas I thought were salient. Being the librarian I am, I
 thought it might be cool (kewl) to make the articles into a collection.
 Thus, the beginnings of Highlights  Annotations: A Value-Added Reading
 List.

 The techno-weenie process for creating and maintaining the content is
 something this community might find interesting:

  1. Print article and read it actively.

  2. Convert the printed article into a PDF
    file -- complete with embedded OCR --
    with my handy-dandy ScanSnap scanner. [1]

  3. Use MyLibrary to create metadata (author,
    title, date published, date read, note,
    keywords, facet/term combinations, local
    and remote URLs, etc.) describing the
    article. [2]

  4. Save the PDF to my file system.

  5. Use pdttotext to extract the OCRed text
    from the PDF and index it along with
    the MyLibrary metadata using Solr. [3, 4]

  6. Provide a searchable/browsable user
    interface to the collection through a
    mod_perl module. [5, 6]

 Software is never done, and if it were then it would be called hardware.
 Accordingly, I know there are some things I need to do before I can truely
 deem the system version 1.0. At the same time my excitment is overflowing
 and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
 files and open source software.


 [1] ScanSnap - http://tinyurl.com/oafgwe
 [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
 [3] pdftotext - http://www.foolabs.com/xpdf/
 [4] Solr - http://lucene.apache.org/solr/
 [5] module source code - http://infomotions.com/highlights/Highlights.pl
 [6] user interface - http://infomotions.com/highlights/highlights.cgi

 --
 Eric Lease Morgan
 University of Notre Dame




 --
 Eric Lease Morgan
 Head, Digital Access and Information Architecture Department
 Hesburgh Libraries, University of Notre Dame

 (574) 631-8604



Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Mark A. Matienzo
Eric,

  5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]


Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library


Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Peter Kiraly

Hi all,

I would like to suggest an API for extracting text (including highlighted or
annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we worked
with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text from
documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF files,
but it has (at least had) some features, which I didn't satisfied with:

- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed text),
several thousands pages length documents (Hungarian scientific journals,
the diary of the Houses of Parliament from the 19th century etc.). We 
indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the web UI,
so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of things
were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - 
From: Mark A. Matienzo m...@matienzo.org

To: CODE4LIB@LISTSERV.ND.EDU
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


 5. Use pdttotext to extract the OCRed text
   from the PDF and index it along with
   the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library



Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread danielle plumer
My (much more primitive) version of the same thing involves reading and
annotating articles using my Tablet PC. Although I do get a variety of print
publications, I find I don't tend to annotate them as much anymore. I used
to use EndNote to do the metadata, then I switched to Zotero. I hadn't
thought to try to create a full-text search of the articles -- hmm.

-- 
Danielle Cunniff Plumer, Coordinator
Texas Heritage Digitization Initiative
Texas State Library and Archives Commission
512.463.5852 (phone) / 512.936.2306 (fax)
dplu...@tsl.state.tx.us
dcplu...@gmail.com


On Tue, Sep 15, 2009 at 8:31 AM, Eric Lease Morgan emor...@nd.edu wrote:

 I have been having fun recently indexing PDF files.

 For the pasts six months or so I have been keeping the articles I've read
 in a pile, and I was rather amazed at the size of the pile. It was about a
 foot tall. When I read these articles I actively read them -- meaning, I
 write, scribble, highlight, and annotate the text with my own special
 notation denoting names, keywords, definitions, citations, quotations, list
 items, examples, etc. This active reading process: 1) makes for better
 comprehension on my part, and 2) makes the articles easier to review and
 pick out the ideas I thought were salient. Being the librarian I am, I
 thought it might be cool (kewl) to make the articles into a collection.
 Thus, the beginnings of Highlights  Annotations: A Value-Added Reading
 List.

 The techno-weenie process for creating and maintaining the content is
 something this community might find interesting:

  1. Print article and read it actively.

  2. Convert the printed article into a PDF
file -- complete with embedded OCR --
with my handy-dandy ScanSnap scanner. [1]

  3. Use MyLibrary to create metadata (author,
title, date published, date read, note,
keywords, facet/term combinations, local
and remote URLs, etc.) describing the
article. [2]

  4. Save the PDF to my file system.

  5. Use pdttotext to extract the OCRed text
from the PDF and index it along with
the MyLibrary metadata using Solr. [3, 4]

  6. Provide a searchable/browsable user
interface to the collection through a
mod_perl module. [5, 6]

 Software is never done, and if it were then it would be called hardware.
 Accordingly, I know there are some things I need to do before I can truely
 deem the system version 1.0. At the same time my excitment is overflowing
 and I thought I'd share some geekdom with my fellow hackers. Fun with PDF
 files and open source software.


 [1] ScanSnap - http://tinyurl.com/oafgwe
 [2] MyLibrary screen dump - http://infomotions.com/tmp/mylibrary.png
 [3] pdftotext - http://www.foolabs.com/xpdf/
 [4] Solr - http://lucene.apache.org/solr/
 [5] module source code - http://infomotions.com/highlights/Highlights.pl
 [6] user interface - http://infomotions.com/highlights/highlights.cgi

 --
 Eric Lease Morgan
 University of Notre Dame




 --
 Eric Lease Morgan
 Head, Digital Access and Information Architecture Department
 Hesburgh Libraries, University of Notre Dame

 (574) 631-8604



Re: [CODE4LIB] indexing pdf files

2009-09-15 Thread Erik Hatcher

Here's a post on how easy it is to send PDF documents to Solr from Java:

  http://www.lucidimagination.com/blog/2009/09/14/posting-rich-documents-to-apache-solr-using-solrj-and-solr-cell-apache-tika/ 



Not only can you post PDF (and other rich content) files to Solr for  
indexing, you can also as shown in that blog entry extract the text  
from such files and have it returned to the client.  This Solr  
capability makes the tool chain a bit simpler.


Erik


On Sep 15, 2009, at 10:31 AM, Peter Kiraly wrote:


Hi all,

I would like to suggest an API for extracting text (including  
highlighted or

annotated ones) from PDF: iText (http://www.lowagie.com/iText/).
This is a Java API (has C# port), and it helped me a lot, when we  
worked

with extraordinary PDF files.

Solr uses Tika (http://lucene.apache.org/tika) for extracting text  
from

documents, and Tika uses PDFBox (http://incubator.apache.org/pdfbox/)
to extract from PDF files, and it is a great tool for the normal PDF  
files,
but it has (at least had) some features, which I didn't satisfied  
with:


- it consumed more memory comparing with iText, and couldn't
read files above a given size (this was large, about 1 GB, but we
had even larger files)

- it couldn't handled correctly the conditional hypens at the end of
the line
- it had poorer documentation then iText, and its API was also
poorer (that time the Manning published the iText in Action book).

Our PDF files were double layered (original hi-res image + OCR-ed  
text),
several thousands pages length documents (Hungarian scientific  
journals,
the diary of the Houses of Parliament from the 19th century etc.).  
We indexed

the content with Lucene, and in the UI we showed one page per screen,
so the user didn't need to download the full PDF. We extracted the
Table of contents from the PDF as well, and we implemented it in the  
web UI,

so the user can browse pages according to the full file's TOC.

This project happened two years ago, so it is possible, that lots of  
things

were changed since that time.

Király Péter
http://eXtensibleCatalog.org

- Original Message - From: Mark A. Matienzo m...@matienzo.org 


To: CODE4LIB@LISTSERV.ND.EDU
Sent: Tuesday, September 15, 2009 3:56 PM
Subject: Re: [CODE4LIB] indexing pdf files



Eric,


5. Use pdttotext to extract the OCRed text
  from the PDF and index it along with
  the MyLibrary metadata using Solr. [3, 4]



Have you considered using Solr's ExtractingRequestHandler [1] for the
PDFs? We're using it at NYPL with pretty great success.

[1] http://wiki.apache.org/solr/ExtractingRequestHandler

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library