We are aware of DOM limitations/memory problems, but I am using SAX to parse
the file and index elements and attributes in my content handler.
Thanks,
Rob
-Original Message-
From: Tatu Saloranta [mailto:[EMAIL PROTECTED]]
Sent: Friday, February 14, 2003 8:18 PM
To: Lucene Users List
I am having similar problem but indexing pdf documents using pdfbox parser (available
at www.pdfbox.com). I get an exception saying Exception in thread main
java.lang.OutOfMemoryError Any body who has implemented the above code? Any help
appreciated???
Thanks!
PI
Rob Outar [EMAIL PROTECTED]
Rob,
We ran into this problem too, and our solution was to use a native PDF
text extractor (PDFBox just can't seem to handle large PDFs well).
Basically, we try to parse with the native app first, and if that fails,
we parse with PDFBox. We used:
http://www.foolabs.com/xpdf/
A code snippet for
I am aware of the issues with parsing certain PDF documents. I am
currently working on refactoring PDFBox to deal with large documents. You
will see this in the next release. I would like to thank people for
feedback and sending problem documents.
Ben Litchfield
http://www.pdfbox.org
On