Hi there,
I'm using PDFTextStripper to extract text from PDFs. Among these PDFs
there are some documents that represent maps. The size of a document is
about 90MB. These maps have very little text, but many little graphic
objects (well, I don't know how to find out, but if I open the document
with Adobe Reader it looks like). This causes the PDFParser to create
millions of COSFloat objects and finally crashes the JVM with an
OutOfMemoryException.
While I understand that it is not possible to extract text without prior
parsing (as noted in the FAQ), I wonder whether it would be possible to
simply skip objects that contain no textual content? The PDF tree would
be incomplete, but I only want to extract the text.
Thanks in advance,
Dominik
P.s.: unfortunately I cannot provide an example of such a document
because they contain confidental content.