We've been struggling with solr hangs in the solr process that indexes
incoming PDF documents.  TLDR; summary is that I'm thinking that
PDFBox needs to have COSName.clearResources() called on it if the solr
indexer expects to be able to keep running indefinitely.  Is that
likely?  Is there anybody on this list who is doing PDF extraction in
a long-running process and having it work?

The thread dump of a hung process often shows lots of threads hanging on this:

java.lang.Thread.State: BLOCKED (on object monitor)
   at java.util.Collections$SynchronizedMap.get(Collections.java:1975)
   - waiting to lock <0x000000072551f908> (a
java.util.Collections$SynchronizedMap)
   at org.apache.pdfbox.util.PDFOperator.getOperator(PDFOperator.java:68)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:441)

And the heap is almost full:

Heap
   PSYoungGen      total 796416K, used 386330K
    eden space 398208K, 97% used
    from space 398208K, 0% used
    to   space 398208K, 0% used
   PSOldGen
    object space 2389376K, 99% used
   PSPermGen
    object space 53824K, 99% used

Using eclipse's mat to look at the heap dump of a hung process shows
one of the chief memory leak suspects is PDFBox's COSName class

    The class "org.apache.pdfbox.cos.COSName", loaded by
"java.net.FactoryURLClassLoader @ 0x725a1a230", occupies 151,183,360
(16.64%) bytes. The memory is accumulated in one instance of
"java.util.concurrent.ConcurrentHashMap$Segment[]" loaded by "<system
class loader>".

and the "Shortest Paths To the Accumulation Point" graph for that
looks like this:

Class Name
  Shallow Heap    Retained Heap

java.util.concurrent.ConcurrentHashMap$Segment[16]
             80        151,160,680
    segments java.util.concurrent.ConcurrentHashMap
                48         151,160,728
        nameMap class org.apache.pdfbox.cos.COSName
            1,184   151,183,360
            [123] java.lang.Object[2560]
                                     10,256 26,004,368
                elementData java.util.Vector
                                     32         26,004,400
                    classes java.net.FactoryURLClassLoader
                        72         26,228,440
                        <classloader> class
org.apache.pdfbox.cos.COSDocument   8         8
                               <class>
org.apache.pdfbox.cos.COSDocument                 64        1,703,704
                                     referent java.lang.ref.Finalizer
                                        40       1,703,744

And the "Dominator Tree" chart looks like this:

26.69% org.apache.solr.core.SolrCore
16.64% class org.apache.pdfbox.cos.COSName
2.89% java.net.Factory.URLClassLoader

Now the implementation of COSName says this:

     /**
      * Not usually needed except if resources need to be reclaimed in a long
      * running process.
      * Patch provided by fles...@gmail.com
      * incorporated 5/23/08, danielwil...@users.sourceforge.net
      */
     public static synchronized void clearResources()
     {
         // Clear them all
         nameMap.clear();
     }

I *don't* see a call to clearResources anywhere in solr or tika, and I
think that's the problem.  The implementation puts all the COSNames in
a class-level static HashMap, which never gets emptied, and apparently
keeps growing forever.  I suspect the fact that the URLClassLoader is
involved in that graph to the COSNames class is what's filling up the
PermGen space in the heap.

Does that sound likely? Possible?  Can anyone speak to that? Anyone
have suggested next steps for us, besides restarting our solr indexer
process every couple of hours?

Reply via email to