Hello,

I used the current pdfbox trunk (1101911) to extract text and pictures from a collection of 65 thousand PDF files of various sizes. Didn't use pdfbox 1.5.0 because I experienced the performance regression described in PDFBOX-1005. The performance of the current trunk is MUCH better than 1.5.0.

The extraction failed after some time with an OutOfMemoryException. When I analyzed the heap dump it turned out that the PDFont.cmapObjects map takes more than 750 megabytes of memory.

1. Is it known already? (I'm not subscribed to the dev list, and it seems like a user issue) 2. Is there any user-available way to clear this map periodically, it seems to me like a cache of some sort.

If not, I'll try to investigate and submit some patch. Just wanted to ask if I'm not reinventing the wheel.

Antoni Myłka
[email protected]

Reply via email to