Hi there,

I'm using PDFTextStripper to extract text from PDFs. Among these PDFs there are some documents that represent maps. The size of a document is about 90MB. These maps have very little text, but many little graphic objects (well, I don't know how to find out, but if I open the document with Adobe Reader it looks like). This causes the PDFParser to create millions of COSFloat objects and finally crashes the JVM with an OutOfMemoryException.

While I understand that it is not possible to extract text without prior parsing (as noted in the FAQ), I wonder whether it would be possible to simply skip objects that contain no textual content? The PDF tree would be incomplete, but I only want to extract the text.

Thanks in advance,

Dominik
P.s.: unfortunately I cannot provide an example of such a document because they contain confidental content.

Reply via email to