Re: Faster Solr Indexing

Mikhail Khludnev Sun, 11 Mar 2012 12:25:09 -0700

Dmitry,

If you start to speak about logging, don't forget to say that jdk logging
is absolutely not really performant, but is default for 3.x. Logback is
much faster.


Peyman,
1. shingles has performance implication. That is. it can cost much. Why
term positions and phrase queries are not enough for you?
2. some time ago there was a similar thread caused by superfluous
shingleing, so it's worth do double check that you produce not much than
you really need (Captian O speaking)
3. when I have problem with performance the first thing I do is profiler or
sampler
4. The way to look inside lucene indexing is enabling infostream, you'll
have a lot of info
5. are all of your cpu cores utilized? is they aren't, employ indexing in
multiple threads. it scales. Post several indexing requests in parallel. Be
aware that DIH doesn't works for multiple threads yet SOLR-3011.
6. Some time ago I need to have a huge throughput and faced the trivial
producer-consumer trap. The indexing app (it was DIH hacked a little) pulls
data from jdbc, but in this time solr indexing were idle, then it pushed
constructed documents into solr for indexing but does it synchronously and
being idle while solr consumes them. As result I had overall time is equal
to sum of producing and consuming. So, I organized async buffer and reduce
time to maximum if those times. Double check that you have maximum of
producing and consuming but not a sum of it. I used perf4j to trace those
times.
7. As you data is huge you can try to employ cluster magic, spread your
docs between two solr instances then search them in parallel SolrShards,
SolrCloud for you, I never did it. If you don't like to search in parallel,
you can copy index shards between boxes to have a full replica on each box
- but I haven't heard about it out-of-the box.

Regards

On Sun, Mar 11, 2012 at 7:27 PM, Dmitry Kan <dmitry....@gmail.com> wrote:

> one approach we have taken was decreasing the solr logging level for
> the posting session, described here (implemented for 1.4, but should
> be easy to port to 3.x):
>
> http://dmitrykan.blogspot.com/2011/01/solr-speed-up-batch-posting.html
>
> On 3/11/12, Yandong Yao <yydz...@gmail.com> wrote:
> > I have similar issues by using DIH,
> > and org.apache.solr.update.DirectUpdateHandler2.addDoc(AddUpdateCommand)
> > consumes most of the time when indexing 10K rows (each row is about 70K)
> >     -  DIH nextRow takes about 10 seconds totally
> >     -  If index uses whitespace tokenizer and lower case filter, then
> > addDoc() methods takes about 80 seconds
> >     -  If index uses whitespace tokenizer, lower case filer, WDF, then
> > addDoc uses about 112 seconds
> >     -  If index uses whitespace tokenizer, lower case filer, WDF and
> porter
> > stemmer, then addDoc uses about 145 seconds
> >
> > We have more than million rows totally, and am wondering whether i am
> using
> > sth. wrong or is there any way to improve the performance of addDoc()?
> >
> > Thanks very much in advance!
> >
> >
> > Following is the configure:
> > 1) JVM:  -Xms256M -Xmx1048M -XX:MaxPermSize=512m
> > 2) Solr version 3.5
> > 3) solrconfig.xml  (almost copied from solr's  example/solr directory.)
> >
> >   <indexDefaults>
> >
> >     <useCompoundFile>false</useCompoundFile>
> >
> >     <mergeFactor>10</mergeFactor>
> >     <!-- Sets the amount of RAM that may be used by Lucene indexing
> >          for buffering added documents and deletions before they are
> >          flushed to the Directory.  -->
> >     <ramBufferSizeMB>64</ramBufferSizeMB>
> >     <!-- If both ramBufferSizeMB and maxBufferedDocs is set, then
> >          Lucene will flush based on whichever limit is hit first.
> >       -->
> >     <!-- <maxBufferedDocs>1000</maxBufferedDocs> -->
> >
> >     <maxFieldLength>2147483647</maxFieldLength>
> >     <writeLockTimeout>1000</writeLockTimeout>
> >     <commitLockTimeout>10000</commitLockTimeout>
> >
> >     <lockType>native</lockType>
> >   </indexDefaults>
> >
> > 2012/3/11 Peyman Faratin <pey...@robustlinks.com>
> >
> >> Hi
> >>
> >> I am trying to index 12MM docs faster than is currently happening in
> Solr
> >> (using solrj). We have identified solr's add method as the bottleneck
> (and
> >> not commit - which is tuned ok through mergeFactor and maxRamBufferSize
> >> and
> >> jvm ram).
> >>
> >> Adding 1000 docs is taking approximately 25 seconds. We are making sure
> we
> >> add and commit in batches. And we've tried both CommonsHttpSolrServer
> and
> >> EmbeddedSolrServer (assuming removing http overhead would speed things
> up
> >> with embedding) but the differences is marginal.
> >>
> >> The docs being indexed are on average 20 fields long, mostly indexed but
> >> none stored. The major size contributors are two fields:
> >>
> >>        - content, and
> >>        - shingledContent (populated using copyField of content).
> >>
> >> The length of the content field is (likely) gaussian distributed (few
> >> large docs 50-80K tokens, but majority around 2k tokens). We use
> >> shingledContent to support phrase queries and content for unigram
> queries
> >> (following the advice of Solr Enterprise search server advice - p. 305,
> >> section "The Solution: Shingling").
> >>
> >> Clearly the size of the docs is a contributor to the slow adds
> (confirmed
> >> by removing these 2 fields resulting in halving the indexing time).
> We've
> >> tried compressed=true also but that is not working.
> >>
> >> Any guidance on how to support our application logic (without having to
> >> change the schema too much) and speed the indexing speed (from current
> 212
> >> days for 12MM docs) would be much appreciated.
> >>
> >> thank you
> >>
> >> Peyman
> >>
> >>
> >
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics

<http://www.griddynamics.com>
 <mkhlud...@griddynamics.com>

Re: Faster Solr Indexing

Reply via email to