Re: Indexing throughput

Erick Erickson Wed, 02 May 2018 10:24:36 -0700

I've seen 1.5 M docs/second. Basically the indexing throughput is gated
by two things:
1> the number of shards. Indexing throughput essentially scales up
reasonably linearly with the number of shards.
2> the indexing program that pushes data to Solr. Before thinking Solr
is the bottleneck, check how fast your ETL process is pushing docs.


This pre-supposes using SolrJ and CloudSolrClient for the final push
to Solr. This pre-buckets the updates and sends the updates for each
shard to the shard leader, thus reducing the amount of work Solr has
to do. If you use SolrJ, you can easily do <2> above by just
commenting out the single call that pushes the docs to Solr in your
program.

Speaking of which, it's definitely best to batch the updates, see:
https://lucidworks.com/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Wed, May 2, 2018 at 10:07 AM, Walter Underwood <wun...@wunderwood.org> wrote:
> We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM 
> each
> (EC2 C4.8xlarge). The collection is 24 million documents with four shards. 
> The cluster
> is Solr 6.6.2. All storage is SSD EBS.
>
> We built a simple batch loader in Java. We get about one million documents 
> per minute
> with 64 threads. We do not use the cloud-smart SolrJ client. We just send all 
> the
> batches to the load balancer and let Solr sort it out.
>
> You are looking for 3 million documents per minute. You will just have to 
> test that.
>
> I haven’t tested it, but indexing should speed up linearly with the number of 
> shards,
> because those are indexing in parallel.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On May 2, 2018, at 9:58 AM, Greenhorn Techie <greenhorntec...@gmail.com> 
>> wrote:
>>
>> Hi,
>>
>> The current hardware profile for our production cluster is 20 nodes, each
>> with 24cores and 256GB memory. Data being indexed is very structured in
>> nature and is about 30 columns or so, out of which half of them are
>> categorical with a defined list of values. The expected peak indexing
>> throughput is to be about *50000* documents per second (expected to be done
>> at off-peak hours so that search requests will be minimal during this time)
>> and the average throughput around *10000* documents (normal business
>> hours).
>>
>> Given the hardware profile, is it realistic and practical to achieve the
>> desired throughput? What factors affect the performance of indexing apart
>> from the above hardware characteristics? I understand that its very
>> difficult to provide any guidance unless a prototype is done. But wondering
>> what are the considerations and dependencies we need to be aware of and
>> whether our throughput expectations are realistic or not.
>>
>> Thanks
>

Re: Indexing throughput

Reply via email to