RE: solr 3.5 and indexing performance

Agnieszka Kukałowicz Tue, 13 Mar 2012 08:43:01 -0700

Hi,

I did some more tests for Hunspell in solr 3.4, 4.0:


Solr 3.4, full import 489017 documents:

StempelPolishStemFilterFactory -  2908 seconds, 168 docs/sec
HunspellStemFilterFactory - 3922 seconds, 125 docs/sec

Solr 4.0, full import 489017 documents:

StempelPolishStemFilterFactory - 3016 seconds, 162 docs/sec
HunspellStemFilterFactory - 44580 seconds (more than 12 hours), 11 docs/sec

Server specification and Java settings are the same as before.

Cheers
Agnieszka


> -----Original Message-----
> From: Agnieszka Kukałowicz [mailto:agnieszka.kukalow...@usable.pl]
> Sent: Tuesday, March 13, 2012 10:39 AM
> To: 'solr-user@lucene.apache.org'
> Subject: RE: solr 3.5 and indexing performance
>
> Hi,
>
> Yes, I confirmed that without Hunspell indexing has normal speed.
> I did tests in solr 4.0 with Hunspell and PolishStemmer.
> With StempelPolishStemFilterFactory the speed is normal.
>
> My schema is quit easy. For Hunspell I have one text field I copy 14
> text fields to:
>
> "<field name="text" type="text_pl_hunspell" indexed="true"
> stored="false" multiValued="true"/>"
>
>
>  <copyField source="field1" dest="text"/>  <copyField source="field2"
> dest="text"/>  <copyField source="field3" dest="text"/>  <copyField
> source="field4" dest="text"/>  <copyField source="field5" dest="text"/>
> <copyField source="field6" dest="text"/>  <copyField source="field7"
> dest="text"/>  <copyField source="field8" dest="text"/>  <copyField
> source="field9" dest="text"/>  <copyField source="field10" dest="text"/>
> <copyField source="field11" dest="text"/>  <copyField source="field12"
> dest="text"/>  <copyField source="field13" dest="text"/>  <copyField
> source="field14" dest="text"/>
>
> The "text_pl_hunspell" configuration:
>
> <fieldType name="text_pl_hunspell" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory"
> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <!--filter class="solr.KeywordMarkerFilterFactory"
> protected="protwords_pl.txt"/-->
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.HunspellStemFilterFactory"
> dictionary="dict/pl_PL.dic" affix="dict/pl_PL.aff" ignoreCase="true"
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
>
> I use Polish dictionary (files stopwords_pl.txt, protwords_pl.txt,
> synonyms_pl.txt are empy)- pl_PL.dic, pl_PL.aff. These are the same
> files I used in 3.4 version.
>
> For Polish Stemmer the diffrence is only in definion text field:
>
> "<field name="text" type="text_pl" indexed="true" stored="false"
> multiValued="true"/>"
>
>     <fieldType name="text_pl" class="solr.TextField"
> positionIncrementGap="100">
>       <analyzer type="index">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <filter class="solr.SynonymFilterFactory"
> synonyms="dict/synonyms_pl.txt" ignoreCase="true" expand="true"/>
>         <filter class="solr.StopFilterFactory"
>                 ignoreCase="true"
>                 words="dict/stopwords_pl.txt"
>                 enablePositionIncrements="true"
>                 />
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StempelPolishStemFilterFactory"/>
>         <filter class="solr.KeywordMarkerFilterFactory"
> protected="dict/protwords_pl.txt"/>
>       </analyzer>
>     </fieldType>
>
> One document has 23 fields:
> - 14 text fields copy to one text field (above) that is only indexed
> - 8 other indexed fields (2 strings, 2 tdates, 3 tint, 1 tfloat) The
> size of one document is 3-4 kB.
> So, I think this is not very complicated schema.
>
> My environment is:
> - Linux, RedHat 6.2, kernel 2.6.32
> - 2 physical CPU Xeon 5606 (4 cores each)
> - 32 GB RAM
> - 2 SSD disks in RAID 0
> - java version:
>
> java -version
> java version "1.6.0_26"
> Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM)
> 64-Bit Server VM (build 20.1-b02, mixed mode)
>
> - java is running with -server -Xms4096M -Xmx4096M (I tried a lot of
> other settings and always I have the same effect)
> - solr has default configuration except Rambuffersize (128MB)
> - solr 4.0 from nightly builds (2012-02-21 build).
>
> If you need more information, please let me know.
> I also will try to use profile to see what happens.
>
> Agnieszka
>
>
> > -----Original Message-----
> > From: Jan Høydahl [mailto:jan....@cominvent.com]
> > Sent: Tuesday, March 13, 2012 9:47 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: solr 3.5 and indexing performance
> >
> > Hi,
> >
> > Have you confirmed that disabling Hunspell in solrconfig gets you back
> > to normal speed?
> > What Hunspell configuration and dictionaries do you have?
> > Can you share more about your environment and documents?
> > Do you have a chance to run a profiler on your Solr instance? Try i.e.
> > VisualVM and run the profiler to see what part of the code takes up
> > the time
> > http://docs.oracle.com/javase/6/docs/technotes/tools/share/jvisualvm.h
> > t
> > ml
> >
> > --
> > Jan Høydahl, search solution architect Cominvent AS -
> > www.cominvent.com Solr Training - www.solrtraining.com
> >
> > On 12. mars 2012, at 16:42, Agnieszka Kukałowicz wrote:
> >
> > > Hi guys,
> > >
> > > I have hit the same problem with Hunspell.
> > > Doing a few tests for 500 000 documents, I've got:
> > >
> > > Hunspell from http://code.google.com/p/lucene-hunspell/ with 3.4
> > > version -
> > > 125 documents per second
> > > Build Hunspell from 4.0 trunk - 11 documents per second.
> > >
> > > All the tests were made on 8 core CPU with 32 GB RAM and index on
> > > SSD disks.
> > > For Solr 3.5 I've tried to change JVM heap size, rambuffersize,
> > > mergefactor but the speed of indexing was about 10 -20 documents per
> > > second.
> > >
> > > Is it possible that there is some performance bug with Solr 4.0?
> > > According to previous post the problem exists in 3.5 version.
> > >
> > > Best regards
> > > Agnieszka Kukałowicz
> > >
> > >
> > >> -----Original Message-----
> > >> From: mizayah [mailto:miza...@gmail.com]
> > >> Sent: Thursday, February 23, 2012 10:19 AM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Re: solr 3.5 and indexing performance
> > >>
> > >> Ok i found it.
> > >>
> > >> Its becouse of Hunspell which now is in solr. Somehow when im using
> > >> it by myself in 3.4 it is a lot of faster then one from 3.5.
> > >>
> > >> Dont know about differences, but is there any way i use my old
> > Google
> > >> Hunspell jar?
> > >>
> > >> --
> > >> View this message in context:
> > >> http://lucene.472066.n3.nabble.com/solr-
> > >> 3-5-and-indexing-performance-tp3766653p3769139.html
> > >> Sent from the Solr - User mailing list archive at Nabble.com.

RE: solr 3.5 and indexing performance

Reply via email to