Re: Best approach to handle large volume of documents with constantly high incoming rate?

Jack Krupansky Sat, 22 Mar 2014 11:18:12 -0700

20K docs/sec = 20,000 * 60 * 60 * 24 = 1,728,000,000 = 1.7 billion docs/day* 365 = 630,720,000,000 = 631 billion docs/yr


At 100 million docs/node = 6,308 nodes!


And you think you can do it with 4 nodes?

Oh, and that's before replication!

0.5K/doc * 631 billion docs = 322 TB.

-- Jack Krupansky

-----Original Message-----From: shushuai zhu

Sent: Saturday, March 22, 2014 11:32 AM
To: solr-user@lucene.apache.org

Subject: Re: Best approach to handle large volume of documents withconstantly high incoming rate?

Any thoughts? Can Solr Cloud support such use case with acceptableperformance?




On Thursday, March 20, 2014 7:51 PM, shushuai zhu <ss...@yahoo.com> wrote:

Hi,

I am looking for some advice to handle large volume of documents with a veryhigh incoming rate. The size of each document is about 0.5 KB and theincoming rate could be more than 20K per second and we want to store aboutone year's documents in Solr for near real=time searching. The goal is toachieve acceptable indexing and querying performance.

We will use techniques like soft commit, dedicated indexing servers, etc. Mymain question is about how to structure the collection/shard/core to achievethe goals. Since the incoming rate is very high, we do not want the incomingdocuments to affect the existing older indexes. Some thoughts are to createa latest index to hold the incoming documents (say latest half hour's data,about 36M docs) so queries on older data could be faster since the oldindexes are not affected. There seem three ways to grow the time dimensionby adding/splitting/creating a new object listed below every half hour:


collection
shard
core

Which is the best way to grow the time dimension? Any limitation in thatdirection? Or there is some better approach?

As an example, I am thinking about having 4 nodes with the followingconfiguration to setup a Solr Cloud:


Memory: 128 GB
Storage: 4 TB

How to set the collection/shard/core to deal with the use case?

Thanks in advance.

Shushuai

Re: Best approach to handle large volume of documents with constantly high incoming rate?

Reply via email to