Re: Indexing throughput

Walter Underwood Wed, 02 May 2018 10:08:14 -0700

We have a similar sized cluster, 32 nodes with 36 processors and 60 Gb RAM each
(EC2 C4.8xlarge). The collection is 24 million documents with four shards. The 
cluster
is Solr 6.6.2. All storage is SSD EBS.


We built a simple batch loader in Java. We get about one million documents per 
minute
with 64 threads. We do not use the cloud-smart SolrJ client. We just send all 
the
batches to the load balancer and let Solr sort it out.

You are looking for 3 million documents per minute. You will just have to test 
that.

I haven’t tested it, but indexing should speed up linearly with the number of 
shards,
because those are indexing in parallel.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 2, 2018, at 9:58 AM, Greenhorn Techie <greenhorntec...@gmail.com> 
> wrote:
> 
> Hi,
> 
> The current hardware profile for our production cluster is 20 nodes, each
> with 24cores and 256GB memory. Data being indexed is very structured in
> nature and is about 30 columns or so, out of which half of them are
> categorical with a defined list of values. The expected peak indexing
> throughput is to be about *50000* documents per second (expected to be done
> at off-peak hours so that search requests will be minimal during this time)
> and the average throughput around *10000* documents (normal business
> hours).
> 
> Given the hardware profile, is it realistic and practical to achieve the
> desired throughput? What factors affect the performance of indexing apart
> from the above hardware characteristics? I understand that its very
> difficult to provide any guidance unless a prototype is done. But wondering
> what are the considerations and dependencies we need to be aware of and
> whether our throughput expectations are realistic or not.
> 
> Thanks

Re: Indexing throughput

Reply via email to