memory leak in pdfbox--SolrCel needs to call COSName.clearResources?

2012-09-24 Thread Kevin Goess
We've been struggling with solr hangs in the solr process that indexes
incoming PDF documents.  TLDR; summary is that I'm thinking that
PDFBox needs to have COSName.clearResources() called on it if the solr
indexer expects to be able to keep running indefinitely.  Is that
likely?  Is there anybody on this list who is doing PDF extraction in
a long-running process and having it work?

The thread dump of a hung process often shows lots of threads hanging on this:

java.lang.Thread.State: BLOCKED (on object monitor)
   at java.util.Collections$SynchronizedMap.get(Collections.java:1975)
   - waiting to lock 0x00072551f908 (a
java.util.Collections$SynchronizedMap)
   at org.apache.pdfbox.util.PDFOperator.getOperator(PDFOperator.java:68)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:441)

And the heap is almost full:

Heap
   PSYoungGen  total 796416K, used 386330K
eden space 398208K, 97% used
from space 398208K, 0% used
to   space 398208K, 0% used
   PSOldGen
object space 2389376K, 99% used
   PSPermGen
object space 53824K, 99% used

Using eclipse's mat to look at the heap dump of a hung process shows
one of the chief memory leak suspects is PDFBox's COSName class

The class org.apache.pdfbox.cos.COSName, loaded by
java.net.FactoryURLClassLoader @ 0x725a1a230, occupies 151,183,360
(16.64%) bytes. The memory is accumulated in one instance of
java.util.concurrent.ConcurrentHashMap$Segment[] loaded by system
class loader.

and the Shortest Paths To the Accumulation Point graph for that
looks like this:

Class Name
  Shallow HeapRetained Heap

java.util.concurrent.ConcurrentHashMap$Segment[16]
 80151,160,680
segments java.util.concurrent.ConcurrentHashMap
48 151,160,728
nameMap class org.apache.pdfbox.cos.COSName
1,184   151,183,360
[123] java.lang.Object[2560]
 10,256 26,004,368
elementData java.util.Vector
 32 26,004,400
classes java.net.FactoryURLClassLoader
72 26,228,440
classloader class
org.apache.pdfbox.cos.COSDocument   8 8
   class
org.apache.pdfbox.cos.COSDocument 641,703,704
 referent java.lang.ref.Finalizer
40   1,703,744

And the Dominator Tree chart looks like this:

26.69% org.apache.solr.core.SolrCore
16.64% class org.apache.pdfbox.cos.COSName
2.89% java.net.Factory.URLClassLoader

Now the implementation of COSName says this:

 /**
  * Not usually needed except if resources need to be reclaimed in a long
  * running process.
  * Patch provided by fles...@gmail.com
  * incorporated 5/23/08, danielwil...@users.sourceforge.net
  */
 public static synchronized void clearResources()
 {
 // Clear them all
 nameMap.clear();
 }

I *don't* see a call to clearResources anywhere in solr or tika, and I
think that's the problem.  The implementation puts all the COSNames in
a class-level static HashMap, which never gets emptied, and apparently
keeps growing forever.  I suspect the fact that the URLClassLoader is
involved in that graph to the COSNames class is what's filling up the
PermGen space in the heap.

Does that sound likely? Possible?  Can anyone speak to that? Anyone
have suggested next steps for us, besides restarting our solr indexer
process every couple of hours?


Re: memory leak in pdfbox--SolrCel needs to call COSName.clearResources?

2012-09-24 Thread Chris Hostetter

: We've been struggling with solr hangs in the solr process that indexes
: incoming PDF documents.  TLDR; summary is that I'm thinking that
: PDFBox needs to have COSName.clearResources() called on it if the solr
: indexer expects to be able to keep running indefinitely.  Is that

I don't know much about tika/pdfbox, but based on the details in your 
email i think you assessment is correct.

Solr (and SolrCell) doen't directly know about PDFBox at all -- that's all 
handled under the covers by Tika.  So i supsect you'd need to file a Jira 
with the Tika project to request that Tika somewhere/somehow call this 
COSName.clearResources() method when using PDFBox -- athough based on your 
description, i'm not sure when/where this owuld make sense.

Two workarrounds i can imagine:

1) if you do a SolrCore RELOAD all of the plugin classes will be 
reloaded in new ClassLoader (assuming you haven't embedded them directly 
in the solr.war, or asked your servlet container to load them for you) ... 
this would be marginally better then doing a full server restart.

2) if you are comfortable with java code, you could write a small 
RequestHandler that did nothing more then call COSName.clearResources() on 
each request -- you could then ping it on a regular basis, or register it 
as part of a newSearcher QuerySendEventListern to ensure it got called 
automaticly (or impelment SolrEventListenr directly and you could trigger 
it on ever commit).

3) heck: with the new ScriptUpdateProcessor in Solr 4.0, you could 
write some javascript in your solrconfig.xml that would call this method 
as part of the chains processCommit() method.

-Hoss