We've been struggling with solr hangs in the solr process that indexes
incoming PDF documents. TLDR; summary is that I'm thinking that
PDFBox needs to have COSName.clearResources() called on it if the solr
indexer expects to be able to keep running indefinitely. Is that
likely? Is there anybody on this list who is doing PDF extraction in
a long-running process and having it work?
The thread dump of a hung process often shows lots of threads hanging on this:
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.Collections$SynchronizedMap.get(Collections.java:1975)
- waiting to lock 0x00072551f908 (a
java.util.Collections$SynchronizedMap)
at org.apache.pdfbox.util.PDFOperator.getOperator(PDFOperator.java:68)
at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:441)
And the heap is almost full:
Heap
PSYoungGen total 796416K, used 386330K
eden space 398208K, 97% used
from space 398208K, 0% used
to space 398208K, 0% used
PSOldGen
object space 2389376K, 99% used
PSPermGen
object space 53824K, 99% used
Using eclipse's mat to look at the heap dump of a hung process shows
one of the chief memory leak suspects is PDFBox's COSName class
The class org.apache.pdfbox.cos.COSName, loaded by
java.net.FactoryURLClassLoader @ 0x725a1a230, occupies 151,183,360
(16.64%) bytes. The memory is accumulated in one instance of
java.util.concurrent.ConcurrentHashMap$Segment[] loaded by system
class loader.
and the Shortest Paths To the Accumulation Point graph for that
looks like this:
Class Name
Shallow HeapRetained Heap
java.util.concurrent.ConcurrentHashMap$Segment[16]
80151,160,680
segments java.util.concurrent.ConcurrentHashMap
48 151,160,728
nameMap class org.apache.pdfbox.cos.COSName
1,184 151,183,360
[123] java.lang.Object[2560]
10,256 26,004,368
elementData java.util.Vector
32 26,004,400
classes java.net.FactoryURLClassLoader
72 26,228,440
classloader class
org.apache.pdfbox.cos.COSDocument 8 8
class
org.apache.pdfbox.cos.COSDocument 641,703,704
referent java.lang.ref.Finalizer
40 1,703,744
And the Dominator Tree chart looks like this:
26.69% org.apache.solr.core.SolrCore
16.64% class org.apache.pdfbox.cos.COSName
2.89% java.net.Factory.URLClassLoader
Now the implementation of COSName says this:
/**
* Not usually needed except if resources need to be reclaimed in a long
* running process.
* Patch provided by fles...@gmail.com
* incorporated 5/23/08, danielwil...@users.sourceforge.net
*/
public static synchronized void clearResources()
{
// Clear them all
nameMap.clear();
}
I *don't* see a call to clearResources anywhere in solr or tika, and I
think that's the problem. The implementation puts all the COSNames in
a class-level static HashMap, which never gets emptied, and apparently
keeps growing forever. I suspect the fact that the URLClassLoader is
involved in that graph to the COSNames class is what's filling up the
PermGen space in the heap.
Does that sound likely? Possible? Can anyone speak to that? Anyone
have suggested next steps for us, besides restarting our solr indexer
process every couple of hours?