Does anybody with knowledge of solr internals know why I'm seeing instances of Lucene42DocValuesProducer when I don't have any fields that are using DocValues? Or am I misunderstanding what this class is for?
-Greg On Mon, Dec 23, 2013 at 12:07 PM, Greg Preston <gpres...@marinsoftware.com> wrote: > Hello, > > I'm loading up our solr cloud with data (from a solrj client) and > running into a weird memory issue. I can reliably reproduce the > problem. > > - Using Solr Cloud 4.4.0 (also replicated with 4.6.0) > - 24 solr nodes (one shard each), spread across 3 physical hosts, each > host has 256G of memory > - index and tlogs on ssd > - Xmx=7G, G1GC > - Java 1.7.0_25 > - schema and solrconfig.xml attached > > I'm using composite routing to route documents with the same clientId > to the same shard. After several hours of indexing, I occasionally > see an IndexWriter go OOM. I think that's a symptom. When that > happens, indexing continues, and that node's tlog starts to grow. > When I notice this, I stop indexing, and bounce the problem node. > That's where it gets interesting. > > Upon bouncing, the tlog replays, and then segments merge. Once the > merging is complete, the heap is fairly full, and forced full GC only > helps a little. But if I then bounce the node again, the heap usage > goes way down, and stays low until the next segment merge. I believe > segment merges are also what causes the original OOM. > > More details: > > Index on disk for this node is ~13G, tlog is ~2.5G. > See attached mem1.png. This is a jconsole view of the heap during the > following: > > (Solr cloud node started at the left edge of this graph) > > A) One CPU core pegged at 100%. Thread dump shows: > "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 > nid=0x7a74 runnable [0x00007f5a41c5f000] > java.lang.Thread.State: RUNNABLE > at org.apache.lucene.util.fst.Builder.add(Builder.java:397) > at > org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finishTerm(BlockTreeTermsWriter.java:1000) > at > org.apache.lucene.codecs.TermsConsumer.merge(TermsConsumer.java:112) > at > org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:72) > at > org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:365) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) > > B) One CPU core pegged at 100%. Manually triggered GC. Lots of > memory freed. Thread dump shows: > "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 > nid=0x7a74 runnable [0x00007f5a41c5f000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127) > at > org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:144) > at > org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) > at > org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) > at > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) > > C) One CPU core pegged at 100%. Manually triggered GC. No memory > freed. Thread dump shows: > "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 > nid=0x7a74 runnable [0x00007f5a41c5f000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.codecs.DocValuesConsumer$1$1.hasNext(DocValuesConsumer.java:127) > at > org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:108) > at > org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92) > at > org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112) > at > org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) > > D) One CPU core pegged at 100%. Thread dump shows: > "Lucene Merge Thread #0" daemon prio=10 tid=0x00007f5a3c064800 > nid=0x7a74 runnable [0x00007f5a41c5f000] > java.lang.Thread.State: RUNNABLE > at > org.apache.lucene.codecs.compressing.CompressingTermVectorsReader.get(CompressingTermVectorsReader.java:322) > at > org.apache.lucene.index.SegmentReader.getTermVectors(SegmentReader.java:169) > at > org.apache.lucene.codecs.compressing.CompressingTermVectorsWriter.merge(CompressingTermVectorsWriter.java:789) > at > org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:312) > at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:130) > at > org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:3772) > at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3376) > at > org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405) > at > org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482) > > E) CPU usage drops to nominal levels. Thread dump shows no Lucene Merge > Thread. > > F) Manually triggered full GC. Some memory freed, but much remains. > > G) Restarted solr. Very little memory used. > > > Throughout all of this, there was no indexing or querying happening. > In ordered to try to determine what's using up the memory, I took a > heap dump at point (F) and analyzed it in Eclipse MAT (see attached > screenshot). This shows 311 instances of Lucene42DocValuesProducer$3, > each holding a large byte[]. By attaching a remote debugger and > re-running, it looks like there is one of these byte[] for each field > in the schema (we have several of the "dim_*" dynamic fields). And as > far as I know, I'm not using DocValues at all. > > > Any hints as to what might be going on here would be greatly > appreciated. It takes me about 10 minutes to reproduce this, so I'm > willing to try things. I don't know enough about the internals of > solr's memory usage to proceed much further on my own. > > Thank you. > > -Greg