Re: Best approach to handle large volume of documents with constantly high incoming rate?

shushuai zhu Sat, 22 Mar 2014 18:49:21 -0700

Jack, thanks for your reply.

Sorry for the confusion about 4 nodes. What I meant was to use 4 nodes to do 
some POC, mainly focusing on handling the high incoming rate in a few days 
instead of storing data over one year.


You estimated the required nodes (6,308) and storage (322TB) based on the 
incoming rate and doc size. I have a few questions regarding to them:

1) Is "100 million docs/node" some general capacity guideline for a Solr node?
2) Assuming we can provide 6,308 nodes, can Solr Cloud really scale to that 
level? I found you indicated some "common sense limits" of Solr Cluster size of 
64 nodes in the following mail thread 
http://find.searchhub.org/document/d823643e65fe2015#84f0c89df2426990
3) If 64 nodes are something we know Solr Cloud can scale up to, then does it 
mean I can only be sure that 1% of the mentioned workload can be handle by Solr 
Cloud? (64 is about 1% of 6,308 nodes)
4) The above mentioned "Solr Limitations" mail thread did mention some cluster 
with 512 nodes but not really verified whether it worked or not; assuming it 
worked, it just means we may be able to handle a little less than 10% of the 
desired workload.
5) Given above simple deduction, it seems 2K docs/sec (10% of the mentioned 
incoming rate) is the practical limitation of Solr Cloud we can guess for our 
use case?
6) If the incoming rate is controlled to be around 1k or 2k docs/sec and we 
want to use Solr Cluster with 64 nodes (or more if it still works), what kind 
of collection/shard/core structure should be?

I am more looking for architectural advice regarding to Solr Cloud structure to 
handle high incoming rate of relatively small docs. 

Regards.

Shushuai 



On Saturday, March 22, 2014 2:17 PM, Jack Krupansky <j...@basetechnology.com> 
wrote:
  
20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion docs/day 
* 365 = 630,720,000,000 = 631 billion docs/yr

At 100 million docs/node = 6,308 nodes!

And you think you can do it with 4 nodes?

Oh, and that's before replication!

0.5K/doc * 631 billion docs = 322 TB.

-- Jack Krupansky


-----Original Message----- 
From: shushuai zhu
Sent: Saturday, March 22, 2014 11:32 AM
To: solr-user@lucene.apache.org
Subject: Re: Best approach to handle large volume of documents with 
constantly high incoming rate?

Any thoughts? Can Solr Cloud support such use case with acceptable 
performance?



On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:

Hi,

I am looking for some advice to handle large volume of documents with a very 
high incoming rate. The size of each document is about 0.5 KB and the 
incoming rate could be more than 20K per second and we want to store about 
one year's documents in Solr for near real=time searching. The goal is to 
achieve acceptable indexing and querying performance.

We will use techniques like soft commit, dedicated indexing servers, etc. My 
main question is about how to structure the collection/shard/core to achieve 
the goals. Since the incoming rate is very high, we do not want the incoming 
documents to affect the existing older indexes. Some thoughts are to create 
a latest index to hold the incoming documents (say latest half hour's data, 
about 36M docs) so queries on older data could be faster since the old 
indexes are not affected. There seem three ways to grow the time dimension 
by adding/splitting/creating a new object listed below every half hour:

collection
shard
core

Which is the best way to grow the time dimension? Any limitation in that 
direction? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the following 
configuration to setup a Solr Cloud:

Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Reply via email to