Hi,
Am 12.05.2011 18:19, schrieb Antoni Mylka:
Hello,
I used the current pdfbox trunk (1101911) to extract text and pictures from a
collection of 65 thousand PDF files of various sizes. Didn't use pdfbox 1.5.0
because I experienced the performance regression described in PDFBOX-1005. The
performance of the current trunk is MUCH better than 1.5.0.
Good to know :-)) Thanks for the feedback.
The extraction failed after some time with an OutOfMemoryException. When I
analyzed the heap dump it turned out that the PDFont.cmapObjects map takes more
than 750 megabytes of memory.
1. Is it known already? (I'm not subscribed to the dev list, and it seems like a
user issue)
No.
2. Is there any user-available way to clear this map periodically, it seems to
me like a cache of some sort.
A workaround could be to call the static method PDFont#clearResources to clear
the cache.
If not, I'll try to investigate and submit some patch. Just wanted to ask if I'm
not reinventing the wheel.
I limited the cache to external CMaps in revision 1102424 as IMO it doesn't
makes sense to cache embedded CMaps. This should solve the memory issue. See [1]
for further details.
Thanks for reporting and analyzing this issue!
BR
Andreas Lehmkühler
[1] https://issues.apache.org/jira/browse/PDFBOX-1009