memory leak in pdfbox--SolrCel needs to call COSName.clearResources?

2012-09-24 Thread Kevin Goess
We've been struggling with solr hangs in the solr process that indexes
incoming PDF documents.  TLDR; summary is that I'm thinking that
PDFBox needs to have COSName.clearResources() called on it if the solr
indexer expects to be able to keep running indefinitely.  Is that
likely?  Is there anybody on this list who is doing PDF extraction in
a long-running process and having it work?

The thread dump of a hung process often shows lots of threads hanging on this:

java.lang.Thread.State: BLOCKED (on object monitor)
   at java.util.Collections$SynchronizedMap.get(Collections.java:1975)
   - waiting to lock 0x00072551f908 (a
java.util.Collections$SynchronizedMap)
   at org.apache.pdfbox.util.PDFOperator.getOperator(PDFOperator.java:68)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:441)

And the heap is almost full:

Heap
   PSYoungGen  total 796416K, used 386330K
eden space 398208K, 97% used
from space 398208K, 0% used
to   space 398208K, 0% used
   PSOldGen
object space 2389376K, 99% used
   PSPermGen
object space 53824K, 99% used

Using eclipse's mat to look at the heap dump of a hung process shows
one of the chief memory leak suspects is PDFBox's COSName class

The class org.apache.pdfbox.cos.COSName, loaded by
java.net.FactoryURLClassLoader @ 0x725a1a230, occupies 151,183,360
(16.64%) bytes. The memory is accumulated in one instance of
java.util.concurrent.ConcurrentHashMap$Segment[] loaded by system
class loader.

and the Shortest Paths To the Accumulation Point graph for that
looks like this:

Class Name
  Shallow HeapRetained Heap

java.util.concurrent.ConcurrentHashMap$Segment[16]
 80151,160,680
segments java.util.concurrent.ConcurrentHashMap
48 151,160,728
nameMap class org.apache.pdfbox.cos.COSName
1,184   151,183,360
[123] java.lang.Object[2560]
 10,256 26,004,368
elementData java.util.Vector
 32 26,004,400
classes java.net.FactoryURLClassLoader
72 26,228,440
classloader class
org.apache.pdfbox.cos.COSDocument   8 8
   class
org.apache.pdfbox.cos.COSDocument 641,703,704
 referent java.lang.ref.Finalizer
40   1,703,744

And the Dominator Tree chart looks like this:

26.69% org.apache.solr.core.SolrCore
16.64% class org.apache.pdfbox.cos.COSName
2.89% java.net.Factory.URLClassLoader

Now the implementation of COSName says this:

 /**
  * Not usually needed except if resources need to be reclaimed in a long
  * running process.
  * Patch provided by fles...@gmail.com
  * incorporated 5/23/08, danielwil...@users.sourceforge.net
  */
 public static synchronized void clearResources()
 {
 // Clear them all
 nameMap.clear();
 }

I *don't* see a call to clearResources anywhere in solr or tika, and I
think that's the problem.  The implementation puts all the COSNames in
a class-level static HashMap, which never gets emptied, and apparently
keeps growing forever.  I suspect the fact that the URLClassLoader is
involved in that graph to the COSNames class is what's filling up the
PermGen space in the heap.

Does that sound likely? Possible?  Can anyone speak to that? Anyone
have suggested next steps for us, besides restarting our solr indexer
process every couple of hours?


solr hanging, possible memory issues?

2012-09-21 Thread Kevin Goess
We're running Solr 3.4, a fairly out-of-the-box solr/jetty setup, with
-Xms1200m -Xmx3500m .  When we start pushing more than a couple documents
per second at it (PDFs, they go through SolrCell/Tika/PDFBox), the java
process hangs, becoming completely unresponsive.

We thought it might be an issue with PDFBox was problematic with lots of
blocked threads, but now we've also noticed that all the thread dumps show
a similar situation in the heap, where three of the areas are at 99%, even
though we don't get any out-of-memory messages.

Heap
   PSYoungGen  total 796416K, used 386330K
eden space 398208K, 97% used
from space 398208K, 0% used
to   space 398208K, 0% used
   PSOldGen
object space 2389376K, 99% used
   PSPermGen
object space 53824K, 99% used

We've also just noticed that after restarting Solr, the PermGen space grows
steadily until it hits 99% and then just stays there.

1) Is that behavior of PermGen normal, growing steadily to 99% and then
staying there?  Apparently we can increase PermSpace
to -XX:MaxPermSize=128M , but if there's a memory leak that only postpones
the problem.

2) If all three of those indicators are pegged at 99%, I would think that
the JVM would throw an out-of-memory exception, rather than just
withdrawing into its own navel, is that expected behavior or is it
indicative of anything else?

Any tips would be greatly appreciated, thanks!

-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing


solr hanging regularly, thread dump attached, PDFOperator

2012-08-30 Thread Kevin Goess
If we have more than half a dozen processes posting updates to our Solr
instance (each about 1/sec), we get regular and fairly reproducible solr
hangs, requiring a kill -9 of the solr process. We're running Solr 3.4, a
fairly out-of-the-box solr/jetty setup.

I've got a thread dump and some GC logging information from the latest one.
 I haven't programmed in java for a decade, and any help decoding these
dumps would be hugely appreciated.

When I first was alerted that the process was hanging, this is the count of
what threads were doing what:

33   java.lang.Thread.State: BLOCKED (on object monitor)
 16   java.lang.Thread.State: RUNNABLE
  2   java.lang.Thread.State: TIMED_WAITING (on object monitor)
  1   java.lang.Thread.State: WAITING (on object monitor)
 77   java.lang.Thread.State: WAITING (parking)

A couple minutes later it looked like this:

29  java.lang.Thread.State: BLOCKED (on object monitor)
  8  java.lang.Thread.State: RUNNABLE
  1  java.lang.Thread.State: TIMED_WAITING (on object monitor)
  1  java.lang.Thread.State: TIMED_WAITING (parking)
  1  java.lang.Thread.State: TIMED_WAITING (sleeping)
  3  java.lang.Thread.State: WAITING (on object monitor)
  1  java.lang.Thread.State: WAITING (parking)

All the blocked threads looked like this:

java.lang.Thread.State: BLOCKED (on object monitor)
   at java.util.Collections$SynchronizedMap.get(Collections.java:1975)
   - waiting to lock 0x00072551f908 (a
java.util.Collections$SynchronizedMap)
   at org.apache.pdfbox.util.PDFOperator.getOperator(PDFOperator.java:68)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser.parseNextToken(PDFStreamParser.java:441)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser.access$000(PDFStreamParser.java:46)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.tryNext(PDFStreamParser.java:175)
   at
org.apache.pdfbox.pdfparser.PDFStreamParser$1.hasNext(PDFStreamParser.java:187)
   at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:266)
   at
org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:251)
   at
org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:225)
   at
org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:442)
   at
org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:366)
   at
org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:322)
   at
org.apache.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:242)
   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53)
   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:90)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
   at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:137)
   at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:213)
   ...

and yes, that 0x00072551f908 address is the same for each thread that's
blocked.

In the first case where there were 77 threads waiting (parking),  they were
all

2012-08-30T20:23:55-0700 java.lang.Thread.State: WAITING (parking)
2012-08-30T20:23:55-0700at sun.misc.Unsafe.park(Native Method)
2012-08-30T20:23:55-0700- parking to wait for  0x000726089700
(a java.util.concurrent.locks.ReentrantReadWriteLock$FairSync)
2012-08-30T20:23:55-0700at
java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
2012-08-30T20:23:55-0700at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
2012-08-30T20:23:55-0700at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)
2012-08-30T20:23:55-0700at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)
2012-08-30T20:23:55-0700at
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)
2012-08-30T20:23:55-0700at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:214)
2012-08-30T20:23:55-0700at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:61)
2012-08-30T20:23:55-0700at
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:115)
2012-08-30T20:23:55-0700at
org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:158)
2012-08-30T20:23:55-0700at
org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
2012-08-30T20:23:55-0700at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
...

and yes, they're all waiting for 0x000726089700


A gist of the first thread dump, with most of the repetition 

turning up logging using the web UI, can't get more than INFO

2012-08-24 Thread Kevin Goess
We have a pretty standard out-of-the-box solr/jetty setup.  Using the web
UI at /solr/admin/logging, for WARNING or SEVERE we get less logging, but
none of CONFIG, FINE or FINEST result in any *more* logging than just at
INFO.

Is there another place to look for something that might be controlling
that?  Maybe a threshold=INFO somewhere?  We've been unable to find
anything.

We're trying to turn up logging because our solr indexing server is hanging
at least once a day and we're trying to find out why.  It becomes
unresponsive and we have to kill -9 it.


Re: turning up logging using the web UI, can't get more than INFO

2012-08-24 Thread Kevin Goess
On Fri, Aug 24, 2012 at 10:23 AM, Ahmet Arslan iori...@yahoo.com wrote:

  We have a pretty standard
  out-of-the-box solr/jetty setup.  Using the web
  UI at /solr/admin/logging, for WARNING or SEVERE we get less
  logging, but
  none of CONFIG, FINE or FINEST result in any *more* logging
  than just at
  INFO.
 Taken from solr-trunk/solr/example/README.txt

 By default, Solr will log to the console. This can be convenient when
 first getting started, but eventually you will want to log to a file. To
 enable logging, you can just pass a system property to Jetty on startup:

 java -Djava.util.logging.config.file=etc/logging.properties -jar start.jar


Thanks, Ahmet. We're actually fine capturing console logging to a file
right now.  But are you saying that the FINEST/FINE/CONFIG levels on the
web ui at /solr/admin/logging/ won't work without a logging.config.file set
up ?





-- 
Kevin M. Goess
Software Engineer
Berkeley Electronic Press
kgo...@bepress.com

510-665-1200 x179
www.bepress.com

bepress: sustainable scholarly publishing