Re: Decision on Number of shards and collection

Erick Erickson Wed, 11 Apr 2018 09:08:19 -0700

50M is a ballpark number I use as a place to _start_ getting a handle
on capacity. It's useful solely to answer the "is it bigger than a breadbox
and smaller than a house" question. It's totally meaningless without
testing.


Say I'm talking to a client and we have no data. Some are scared
that their 10M docs will require lots of hardware. Saying "I usualy
expect to see 50M docs on a node" gives them
some confidence that it's not going to require a massive hardware
investment and they can go forward with a PoC.

OTOH I have other clients saying "We have 100B documents" and
I have to say "You could be talking 200 nodes" which gives them
incentive to do a PoC to get a hard number.

I do recommend you keep adding (perhaps synthetic) docs to your
node until it tips over. Finding your installation falls over at, say, 50M
docs means you need to start taking action beforehand. OTOH if you
load 150M docs on it and still function OK you can breathe a lot
easier...

Best,
Erick



On Wed, Apr 11, 2018 at 8:55 AM, Abhi Basu <9000r...@gmail.com> wrote:
>  *The BKM I have read so far (trying to find source) says 50 million
> docs/shard performs well. I have found this in my recent tests as well. But
> of course it depends on index structure, etc.*
>
> On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <apa...@elyograg.org> wrote:
>
>> On 4/11/2018 4:15 AM, neotorand wrote:
>> > I believe heterogeneous data can be indexed to same collection and i can
>> > have multiple shards for the index to be partitioned.So whats the need
>> of a
>> > second collection?. yes when collection size grows i should look for more
>> > collection.what exactly that size is? what KPI drives the decision of
>> having
>> > more collection?Any pointers or links for best practice.
>>
>> There are no hard rules.  Many factors affect these decisions.
>>
>> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-
>> abstract-why-we-dont-have-a-definitive-answer/
>>
>> Creating multiple collections should be done when there is a logical or
>> business reason for keeping different sets of data separate from each
>> other.  If there's never any need for people to query all the data at
>> once, then it might make sense to use separate collections.  Or you
>> might want to put them together just for convenience, and use data in
>> the index to filter the results to only the information that the user is
>> allowed to access.
>>
>> > when should i go for multiple shards?
>> > yes when shard size grows.Right? whats the size and how do i benchmark.
>>
>> Some indexes function really well with 300 million documents or more per
>> shard.  Other indexes struggle with less than a million per shard.  It's
>> impossible to give you any specific number.  It depends on a bunch of
>> factors.
>>
>> If query rate is very high, then you want to keep the shard count low.
>> Using one shard might not be possible due to index size, but it should
>> be as low as you can make it.  You're also going to want to have a lot
>> of replicas to handle the load.
>>
>> If query rate is extremely low, then sharding the index can actually
>> *improve* performance, because there will be idle CPU capacity that can
>> be used for the subqueries.
>>
>> Thanks,
>> Shawn
>>
>>
>
>
> --
> Abhi Basu

Re: Decision on Number of shards and collection

Reply via email to