We've been struggling with solr hangs in the solr process that indexes incoming PDF documents. TLDR; summary is that I'm thinking that PDFBox needs to have COSName.clearResources() called on it if the solr indexer expects to be able to keep running indefinitely. Is that likely? Is there anybody on this list who is doing PDF extraction in a long-running process and having it work?
The thread dump of a hung process often shows lots of threads hanging on this: java.lang.Thread.State: BLOCKED (on object monitor) at java.util.Collections$SynchronizedMap.get(Collections.java:1975) - waiting to lock <0x000000072551f908> (a java.util.Collections$SynchronizedMap) at org.apache.pdfbox.util.PDFOperator.getOperator(PDFOperator.java:68) at org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:441) And the heap is almost full: Heap PSYoungGen total 796416K, used 386330K eden space 398208K, 97% used from space 398208K, 0% used to space 398208K, 0% used PSOldGen object space 2389376K, 99% used PSPermGen object space 53824K, 99% used Using eclipse's mat to look at the heap dump of a hung process shows one of the chief memory leak suspects is PDFBox's COSName class The class "org.apache.pdfbox.cos.COSName", loaded by "java.net.FactoryURLClassLoader @ 0x725a1a230", occupies 151,183,360 (16.64%) bytes. The memory is accumulated in one instance of "java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "<system class loader>". and the "Shortest Paths To the Accumulation Point" graph for that looks like this: Class Name Shallow Heap Retained Heap java.util.concurrent.ConcurrentHashMap$Segment[16] 80 151,160,680 segments java.util.concurrent.ConcurrentHashMap 48 151,160,728 nameMap class org.apache.pdfbox.cos.COSName 1,184 151,183,360 [123] java.lang.Object[2560] 10,256 26,004,368 elementData java.util.Vector 32 26,004,400 classes java.net.FactoryURLClassLoader 72 26,228,440 <classloader> class org.apache.pdfbox.cos.COSDocument 8 8 <class> org.apache.pdfbox.cos.COSDocument 64 1,703,704 referent java.lang.ref.Finalizer 40 1,703,744 And the "Dominator Tree" chart looks like this: 26.69% org.apache.solr.core.SolrCore 16.64% class org.apache.pdfbox.cos.COSName 2.89% java.net.Factory.URLClassLoader Now the implementation of COSName says this: /** * Not usually needed except if resources need to be reclaimed in a long * running process. * Patch provided by fles...@gmail.com * incorporated 5/23/08, danielwil...@users.sourceforge.net */ public static synchronized void clearResources() { // Clear them all nameMap.clear(); } I *don't* see a call to clearResources anywhere in solr or tika, and I think that's the problem. The implementation puts all the COSNames in a class-level static HashMap, which never gets emptied, and apparently keeps growing forever. I suspect the fact that the URLClassLoader is involved in that graph to the COSNames class is what's filling up the PermGen space in the heap. Does that sound likely? Possible? Can anyone speak to that? Anyone have suggested next steps for us, besides restarting our solr indexer process every couple of hours?