Hello,
I used the current pdfbox trunk (1101911) to extract text and pictures
from a collection of 65 thousand PDF files of various sizes. Didn't use
pdfbox 1.5.0 because I experienced the performance regression described
in PDFBOX-1005. The performance of the current trunk is MUCH better than
1.5.0.
The extraction failed after some time with an OutOfMemoryException. When
I analyzed the heap dump it turned out that the PDFont.cmapObjects map
takes more than 750 megabytes of memory.
1. Is it known already? (I'm not subscribed to the dev list, and it
seems like a user issue)
2. Is there any user-available way to clear this map periodically, it
seems to me like a cache of some sort.
If not, I'll try to investigate and submit some patch. Just wanted to
ask if I'm not reinventing the wheel.
Antoni Myłka
[email protected]