Lucene OOM

2018-01-10 Thread dawn breaks
Hi, all
  We have a search engine service built with lucene 4.7,  it seem that
lucene eat too much momery, and we have approximate 10 million document,the
index size on disk is approximate 750G.  My question is why the FST$Arc
objects consume so much memory?  please refer to the following histo stat
of jmap. Hope anybody can give me some suggestion.

 num #instances #bytes  class name
--
   1:   4346283 2294837424  [Lorg.apache.lucene.util.fst.FST$Arc;
   2:  25918804 2023475632  [C
   3:  17450041 1014051416  [B
   4:  25878734  621089616  java.lang.String
   5:  18634803  596313696  java.util.HashMap$Node
   6:  14039862  561594480  java.util.TreeMap$Entry
   7:   4346283  452013432  org.apache.lucene.util.fst.FST
   8:   4522836  424741520  [Ljava.util.HashMap$Node;
   9:   4346283  347702640
org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader
  10:   4683616  337220352  org.apache.lucene.util.fst.FST$Arc
  11:  12947467  310739208  org.apache.lucene.util.BytesRef
  12:790283  280383040  [J
  13:   4359111  245496264  [Ljava.lang.Object;
  14:   4545337  218176176  java.util.HashMap
  15:   4510384  216498432  org.apache.lucene.index.FieldInfo
  16:   4359066  199713232  [I
  17:   4346283  173851320  org.apache.lucene.util.fst.BytesStore
  18:   4510400  144332800  java.util.Collections$UnmodifiableMap
  19:   4354347  104504328  java.util.ArrayList
  20:   5736589   91785424  java.lang.Integer
  21:822685   59233320
org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer$NumericEntry
  22:428313   13706016
org.apache.lucene.facet.taxonomy.writercache.CollisionMap$Entry
  23:420547   13457504  org.wltea.analyzer.dic.DictSegment
  24:1770395665248  [Lorg.wltea.analyzer.dic.DictSegment;
  25:205112128
[Lorg.apache.lucene.facet.taxonomy.writercache.CollisionMap$Entry;
  26: 424542377424  org.apache.lucene.store.RAMInputStream
  27: 500542002160  org.apache.lucene.util.packed.Packed64
  28: 440361761440
org.apache.lucene.util.packed.DirectPackedReader
  29: 330131056416
java.util.concurrent.ConcurrentHashMap$Node
  30: 439571054968
org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer$2




Thanks & Best Regards!
lubin


Re: Lucene OOM

2018-01-11 Thread dawn breaks
nly
> have 10 million documents!
>
> Are those documents huge and have lots of indexed text content, possibly
> OCR/scanned stuff? If this is the case, the term dictionary may get huge
> because of many terms with incorrect spelling.
>
> Please also give us a "ls -lh" of your index directory to make a guess.
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: dawn breaks [mailto:2005dawnbre...@gmail.com]
> > Sent: Thursday, January 11, 2018 3:40 AM
> > To: java-user@lucene.apache.org
> > Subject: Lucene OOM
> >
> > Hi, all
> >   We have a search engine service built with lucene 4.7,  it seem that
> > lucene eat too much momery, and we have approximate 10 million
> > document,the
> > index size on disk is approximate 750G.  My question is why the FST$Arc
> > objects consume so much memory?  please refer to the following histo stat
> > of jmap. Hope anybody can give me some suggestion.
> >
> >  num #instances #bytes  class name
> > --
> >1:   4346283 2294837424  [Lorg.apache.lucene.util.fst.
> FST$Arc;
> >2:  25918804 2023475632  [C
> >3:  17450041 1014051416  [B
> >4:  25878734  621089616  java.lang.String
> >5:  18634803  596313696  java.util.HashMap$Node
> >6:  14039862  561594480  java.util.TreeMap$Entry
> >7:   4346283  452013432  org.apache.lucene.util.fst.FST
> >8:   4522836  424741520  [Ljava.util.HashMap$Node;
> >9:   4346283  347702640
> > org.apache.lucene.codecs.BlockTreeTermsReader$FieldReader
> >   10:   4683616  337220352  org.apache.lucene.util.fst.FST$Arc
> >   11:  12947467  310739208  org.apache.lucene.util.BytesRef
> >   12:790283  280383040  [J
> >   13:   4359111  245496264  [Ljava.lang.Object;
> >   14:   4545337  218176176  java.util.HashMap
> >   15:   4510384  216498432  org.apache.lucene.index.FieldInfo
> >   16:   4359066  199713232  [I
> >   17:   4346283  173851320  org.apache.lucene.util.fst.
> BytesStore
> >   18:   4510400  144332800  java.util.Collections$
> UnmodifiableMap
> >   19:   4354347  104504328  java.util.ArrayList
> >   20:   5736589   91785424  java.lang.Integer
> >   21:822685   59233320
> > org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer$NumericE
> > ntry
> >   22:428313   13706016
> > org.apache.lucene.facet.taxonomy.writercache.CollisionMap$Entry
> >   23:420547   13457504  org.wltea.analyzer.dic.DictSegment
> >   24:1770395665248  [Lorg.wltea.analyzer.dic.
> DictSegment;
> >   25:205112128
> > [Lorg.apache.lucene.facet.taxonomy.writercache.CollisionMap$Entry;
> >   26: 424542377424  org.apache.lucene.store.
> RAMInputStream
> >   27: 500542002160  org.apache.lucene.util.packed.
> Packed64
> >   28: 440361761440
> > org.apache.lucene.util.packed.DirectPackedReader
> >   29: 330131056416
> > java.util.concurrent.ConcurrentHashMap$Node
> >   30: 439571054968
> > org.apache.lucene.codecs.lucene45.Lucene45DocValuesProducer$2
> >
> >
> >
> >
> > Thanks & Best Regards!
> > lubin
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Lucene OOM

2018-01-11 Thread dawn breaks
Hi, Uwe
  Yes, All indexes running in the same JVM with 14GiB of heap space, but
the JVM heap usage is up to 95%.  I'am sue that all
IndexReaders/IndexSearchers has been closed properly.



On 11 January 2018 at 20:46, Uwe Schindler  wrote:

> Hi lubin,
>
> the terms dictionary is using the "tim" and "tip" files. It should be
> approximately in the dimension of the FST.
>
> Do you have all indexes running in the same JVM or is it 10 servers?
> Because then the numbers look correct. If you really want to have such an
> large index in a single machine using a single JVM, you should plan for
> more heap space. I'd start with 12 GiB of heap space to run this index.
>
> A last recommendation: If you update your index during runtime, make sure
> that you correctly close the outdated IndexReaders/IndexSearchers (e.g.
> using SearcherManager), so you don't have orphaned instances of IndexReader
> consuming heap space and disk space, because the files can't be fully
> deleted as long as those are open!
>
> Uwe
>
> -
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> http://www.thetaphi.de
> eMail: u...@thetaphi.de
>
> > -Original Message-
> > From: dawn breaks [mailto:2005dawnbre...@gmail.com]
> > Sent: Thursday, January 11, 2018 10:22 AM
> > To: java-user@lucene.apache.org
> > Subject: Re: Lucene OOM
> >
> > Hi, Uwe
> >   Thanks for your timely reply. Yes,  those documents are huge text. We
> > have ten indices, and each of them has approximate 75G index size on
> disk.
> > Following is the directory  content of one of the indices.
> >
> > Thanks & Best Regards!
> > lubin
> >
> > total 74G
> > -rw-r--r-- 1 root root  100K Jan 10 16:11 _2ncr_4k.del
> > -rw-r--r-- 1 root root  1.1G Aug  4 12:52 _2ncr.fdt
> > -rw-r--r-- 1 root root  468K Aug  4 12:52 _2ncr.fdx
> > -rw-r--r-- 1 root root  636K Aug  4 12:55 _2ncr.fnm
> > -rw-r--r-- 1 root root  398M Aug  4 12:53 _2ncr_Lucene41_0.doc
> > -rw-r--r-- 1 root root  712M Aug  4 12:53 _2ncr_Lucene41_0.pay
> > -rw-r--r-- 1 root root  744M Aug  4 12:53 _2ncr_Lucene41_0.pos
> > -rw-r--r-- 1 root root  129M Aug  4 12:53 _2ncr_Lucene41_0.tim
> > -rw-r--r-- 1 root root  3.1M Aug  4 12:53 _2ncr_Lucene41_0.tip
> > -rw-r--r-- 1 root root  822M Aug  4 12:54 _2ncr_Lucene45_0.dvd
> > -rw-r--r-- 1 root root  210K Aug  4 12:54 _2ncr_Lucene45_0.dvm
> > -rw-r--r-- 1 root root   540 Aug  4 12:55 _2ncr.si
> > -rw-r--r-- 1 root root  1.5G Aug  4 12:55 _2ncr.tvd
> > -rw-r--r-- 1 root root  441K Aug  4 12:55 _2ncr.tvx
> > -rw-r--r-- 1 root root   98K Jan 11 11:43 _555c_5x.del
> > -rw-r--r-- 1 root root  1.1G Aug 25 12:51 _555c.fdt
> > -rw-r--r-- 1 root root  457K Aug 25 12:51 _555c.fdx
> > -rw-r--r-- 1 root root  872K Aug 25 12:54 _555c.fnm
> > -rw-r--r-- 1 root root  389M Aug 25 12:52 _555c_Lucene41_0.doc
> > -rw-r--r-- 1 root root  713M Aug 25 12:52 _555c_Lucene41_0.pay
> > -rw-r--r-- 1 root root  750M Aug 25 12:52 _555c_Lucene41_0.pos
> > -rw-r--r-- 1 root root  136M Aug 25 12:52 _555c_Lucene41_0.tim
> > -rw-r--r-- 1 root root  3.2M Aug 25 12:52 _555c_Lucene41_0.tip
> > -rw-r--r-- 1 root root  1.1G Aug 25 12:53 _555c_Lucene45_0.dvd
> > -rw-r--r-- 1 root root  442K Aug 25 12:53 _555c_Lucene45_0.dvm
> > -rw-r--r-- 1 root root   540 Aug 25 12:54 _555c.si
> > -rw-r--r-- 1 root root  1.4G Aug 25 12:54 _555c.tvd
> > -rw-r--r-- 1 root root  422K Aug 25 12:54 _555c.tvx
> > -rw-r--r-- 1 root root   93K Jan 10 16:38 _790n_5s.del
> > -rw-r--r-- 1 root root  1.1G Sep  6 14:17 _790n.fdt
> > -rw-r--r-- 1 root root  438K Sep  6 14:17 _790n.fdx
> > -rw-r--r-- 1 root root  1.1M Sep  6 14:20 _790n.fnm
> > -rw-r--r-- 1 root root  380M Sep  6 14:18 _790n_Lucene41_0.doc
> > -rw-r--r-- 1 root root  684M Sep  6 14:18 _790n_Lucene41_0.pay
> > -rw-r--r-- 1 root root  746M Sep  6 14:18 _790n_Lucene41_0.pos
> > -rw-r--r-- 1 root root  141M Sep  6 14:18 _790n_Lucene41_0.tim
> > -rw-r--r-- 1 root root  3.5M Sep  6 14:18 _790n_Lucene41_0.tip
> > -rw-r--r-- 1 root root  1.2G Sep  6 14:20 _790n_Lucene45_0.dvd
> > -rw-r--r-- 1 root root  550K Sep  6 14:20 _790n_Lucene45_0.dvm
> > -rw-r--r-- 1 root root   540 Sep  6 14:20 _790n.si
> > -rw-r--r-- 1 root root  1.4G Sep  6 14:20 _790n.tvd
> > -rw-r--r-- 1 root root  412K Sep  6 14:20 _790n.tvx
> > -rw-r--r-- 1 root root   82K Jan 10 16:38 _bv18_8d.del
> > -rw-r--r-- 1 root root  1.1G Oct 10 12:17 _bv18.fdt
> > -rw-r--r-- 1 root root  425K Oct 10 12:17 _bv18.fdx
> > -rw-r--r-- 1 root root  1.4M Oct 10 12:20 _bv18.fnm
> > -rw-r--r-- 1 root root  363M Oct 10 12:18 _bv18_Lucene41_0.doc
&g