Extract Text from complex PDF

Raymi Mon, 07 Dec 2009 05:49:00 -0800

Hi there,

I'm using PDFTextStripper to extract text from PDFs. Among these PDFsthere are some documents that represent maps. The size of a document isabout 90MB. These maps have very little text, but many little graphicobjects (well, I don't know how to find out, but if I open the documentwith Adobe Reader it looks like). This causes the PDFParser to createmillions of COSFloat objects and finally crashes the JVM with anOutOfMemoryException.

While I understand that it is not possible to extract text without priorparsing (as noted in the FAQ), I wonder whether it would be possible tosimply skip objects that contain no textual content? The PDF tree wouldbe incomplete, but I only want to extract the text.


Thanks in advance,

Dominik

P.s.: unfortunately I cannot provide an example of such a documentbecause they contain confidental content.

Extract Text from complex PDF

Reply via email to