Re: Best Indexing Approaches - To max the throughput

Walter Underwood Tue, 06 Oct 2015 10:17:34 -0700

This is at Chegg. One of our indexes is textbooks. These are expensive and 
don’t change very often. It is better to keep yesterday’s index than to drop a 
few important books.


We have occasionally had an error that happens with every book, like a new 
field that is not in the Solr schema. If we ignored errors with that, we’d have 
an empty index: delete all, add all (failing), commit.

With the fail fast and rollback, we can catch problems before they mess up the 
index.

Also, to pinpoint isolated problems, if there is an error in the batch, it 
re-submits that batch one at a time, so we get an accurate report of which 
document was rejected. I wrote that same thing back at Netflix, before SolrJ.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2015, at 9:49 AM, Alessandro Benedetti <benedetti.ale...@gmail.com> 
> wrote:
> 
> Hi Walter,
> can you explain better your use case ?
> You index a batch of e-commerce products ( Solr documents) if one fails,
> you want to stop and invalidate the entire batch ( using the almost never
> used solr rollback, or manual deletion ?)
> And then log the exception indexing size.
> To then re-index the whole batch od docs ?
> 
> In this scenario, the ConcurrentUpdateSolrClient will not be ideal?
> Only curiosity.
> 
> Cheers
> 
> On 6 October 2015 at 17:29, Walter Underwood <wun...@wunderwood.org> wrote:
> 
>> It depends on the document. In a e-commerce search, you might want to fail
>> immediately and be notified. That is what we do, fail, rollback, and notify.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 6, 2015, at 7:58 AM, Alessandro Benedetti <
>> benedetti.ale...@gmail.com> wrote:
>>> 
>>> mmmmmm one broken document in a batch should not break the entire batch ,
>>> right ( whatever approach used) ?
>>> Are you referring to the fact that you want to programmatically re-index
>>> the broken docs ?
>>> 
>>> Would be interesting to return the id of the broken docs along with the
>>> solr update response!
>>> 
>>> Cheers
>>> 
>>> 
>>> On 6 October 2015 at 15:30, Bill Dueber <b...@dueber.com> wrote:
>>> 
>>>> Just to add...my informal tests show that batching has waaaaay more
>> effect
>>>> than solrj vs json.
>>>> 
>>>> I haven't look at CUSC in a while, last time I looked it was impossible
>> to
>>>> do anything smart about error handling, so check that out before you get
>>>> too deeply into it. We use a strategy of sending a batch of json
>> documents,
>>>> and if it returns an error sending each record one at a time until we
>> find
>>>> the bad one and can log something useful.
>>>> 
>>>> 
>>>> 
>>>> On Mon, Oct 5, 2015 at 12:07 PM, Alessandro Benedetti <
>>>> benedetti.ale...@gmail.com> wrote:
>>>> 
>>>>> Thanks Erick,
>>>>> you confirmed my impressions!
>>>>> Thank you very much for the insights, an other opinion is welcome :)
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> 2015-10-05 14:55 GMT+01:00 Erick Erickson <erickerick...@gmail.com>:
>>>>> 
>>>>>> SolrJ tends to be faster for several reasons, not the least of which
>>>>>> is that it sends packets to Solr in a more efficient binary format.
>>>>>> 
>>>>>> Batching is critical. I did some rough tests using SolrJ and sending
>>>>>> docs one at a time gave a throughput of < 400 docs/second.
>>>>>> Sending 10 gave 2,300 or so. Sending 100 at a time gave
>>>>>> over 5,300 docs/second. Curiously, 1,000 at a time gave only
>>>>>> marginal improvement over 100. This was with a single thread.
>>>>>> YMMV of course.
>>>>>> 
>>>>>> CloudSolrClient is definitely the better way to go with SolrCloud,
>>>>>> it routes the docs to the correct leader instead of having the
>>>>>> node you send the docs to do the routing.
>>>>>> 
>>>>>> Best,
>>>>>> Erick
>>>>>> 
>>>>>> On Mon, Oct 5, 2015 at 4:57 AM, Alessandro Benedetti
>>>>>> <abenede...@apache.org> wrote:
>>>>>>> I was doing some studies and analysis, just wondering in your opinion
>>>>>> which
>>>>>>> one is the best approach to use to index in Solr to reach the best
>>>>>>> throughput possible.
>>>>>>> I know that a lot of factor are affecting Indexing time, so let's
>>>> only
>>>>>>> focus in the feeding approach.
>>>>>>> Let's isolate different scenarios :
>>>>>>> 
>>>>>>> *Single Solr Infrastructure*
>>>>>>> 
>>>>>>> 1) Xml/Json batch request to /update IndexHandler (xml/json)
>>>>>>> 
>>>>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
>>>>>>> I was thinking this to be the fastest approach for a multi threaded
>>>>>>> indexing application.
>>>>>>> Posting batch of docs if possible per request.
>>>>>>> 
>>>>>>> *Solr Cloud*
>>>>>>> 
>>>>>>> 1) Xml/Json batch request to /update IndexHandler(xml/json)
>>>>>>> 
>>>>>>> 2) SolrJ ConcurrentUpdateSolrClient ( javabin)
>>>>>>> 
>>>>>>> 3) CloudSolrClient ( javabin)
>>>>>>> it seems the best approach accordingly to this improvements [1]
>>>>>>> 
>>>>>>> What are your opinions ?
>>>>>>> 
>>>>>>> A bonus observation should be for using some Map/Reduce big data
>>>>> indexer,
>>>>>>> but let's assume we don't have a big cluster of cpus, but the average
>>>>>>> Indexer server.
>>>>>>> 
>>>>>>> 
>>>>>>> [1]
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>> https://lucidworks.com/blog/indexing-performance-solr-5-2-now-twice-fast/
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> --------------------------
>>>>>>> 
>>>>>>> Benedetti Alessandro
>>>>>>> Visiting card : http://about.me/alessandro_benedetti
>>>>>>> 
>>>>>>> "Tyger, tyger burning bright
>>>>>>> In the forests of the night,
>>>>>>> What immortal hand or eye
>>>>>>> Could frame thy fearful symmetry?"
>>>>>>> 
>>>>>>> William Blake - Songs of Experience -1794 England
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> --------------------------
>>>>> 
>>>>> Benedetti Alessandro
>>>>> Visiting card - http://about.me/alessandro_benedetti
>>>>> Blog - http://alexbenedetti.blogspot.co.uk
>>>>> 
>>>>> "Tyger, tyger burning bright
>>>>> In the forests of the night,
>>>>> What immortal hand or eye
>>>>> Could frame thy fearful symmetry?"
>>>>> 
>>>>> William Blake - Songs of Experience -1794 England
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Bill Dueber
>>>> Library Systems Programmer
>>>> University of Michigan Library
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> --------------------------
>>> 
>>> Benedetti Alessandro
>>> Visiting card - http://about.me/alessandro_benedetti
>>> Blog - http://alexbenedetti.blogspot.co.uk
>>> 
>>> "Tyger, tyger burning bright
>>> In the forests of the night,
>>> What immortal hand or eye
>>> Could frame thy fearful symmetry?"
>>> 
>>> William Blake - Songs of Experience -1794 England
>> 
>> 
> 
> 
> -- 
> --------------------------
> 
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
> 
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
> 
> William Blake - Songs of Experience -1794 England

Re: Best Indexing Approaches - To max the throughput

Reply via email to