*The BKM I have read so far (trying to find source) says 50 million
docs/shard performs well. I have found this in my recent tests as well. But
of course it depends on index structure, etc.*

On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 4/11/2018 4:15 AM, neotorand wrote:
> > I believe heterogeneous data can be indexed to same collection and i can
> > have multiple shards for the index to be partitioned.So whats the need
> of a
> > second collection?. yes when collection size grows i should look for more
> > collection.what exactly that size is? what KPI drives the decision of
> having
> > more collection?Any pointers or links for best practice.
>
> There are no hard rules.  Many factors affect these decisions.
>
> https://lucidworks.com/2012/07/23/sizing-hardware-in-the-
> abstract-why-we-dont-have-a-definitive-answer/
>
> Creating multiple collections should be done when there is a logical or
> business reason for keeping different sets of data separate from each
> other.  If there's never any need for people to query all the data at
> once, then it might make sense to use separate collections.  Or you
> might want to put them together just for convenience, and use data in
> the index to filter the results to only the information that the user is
> allowed to access.
>
> > when should i go for multiple shards?
> > yes when shard size grows.Right? whats the size and how do i benchmark.
>
> Some indexes function really well with 300 million documents or more per
> shard.  Other indexes struggle with less than a million per shard.  It's
> impossible to give you any specific number.  It depends on a bunch of
> factors.
>
> If query rate is very high, then you want to keep the shard count low.
> Using one shard might not be possible due to index size, but it should
> be as low as you can make it.  You're also going to want to have a lot
> of replicas to handle the load.
>
> If query rate is extremely low, then sharding the index can actually
> *improve* performance, because there will be idle CPU capacity that can
> be used for the subqueries.
>
> Thanks,
> Shawn
>
>


-- 
Abhi Basu

Reply via email to