*The BKM I have read so far (trying to find source) says 50 million docs/shard performs well. I have found this in my recent tests as well. But of course it depends on index structure, etc.*
On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 4/11/2018 4:15 AM, neotorand wrote: > > I believe heterogeneous data can be indexed to same collection and i can > > have multiple shards for the index to be partitioned.So whats the need > of a > > second collection?. yes when collection size grows i should look for more > > collection.what exactly that size is? what KPI drives the decision of > having > > more collection?Any pointers or links for best practice. > > There are no hard rules. Many factors affect these decisions. > > https://lucidworks.com/2012/07/23/sizing-hardware-in-the- > abstract-why-we-dont-have-a-definitive-answer/ > > Creating multiple collections should be done when there is a logical or > business reason for keeping different sets of data separate from each > other. If there's never any need for people to query all the data at > once, then it might make sense to use separate collections. Or you > might want to put them together just for convenience, and use data in > the index to filter the results to only the information that the user is > allowed to access. > > > when should i go for multiple shards? > > yes when shard size grows.Right? whats the size and how do i benchmark. > > Some indexes function really well with 300 million documents or more per > shard. Other indexes struggle with less than a million per shard. It's > impossible to give you any specific number. It depends on a bunch of > factors. > > If query rate is very high, then you want to keep the shard count low. > Using one shard might not be possible due to index size, but it should > be as low as you can make it. You're also going to want to have a lot > of replicas to handle the load. > > If query rate is extremely low, then sharding the index can actually > *improve* performance, because there will be idle CPU capacity that can > be used for the subqueries. > > Thanks, > Shawn > > -- Abhi Basu