50M is a ballpark number I use as a place to _start_ getting a handle on capacity. It's useful solely to answer the "is it bigger than a breadbox and smaller than a house" question. It's totally meaningless without testing.
Say I'm talking to a client and we have no data. Some are scared that their 10M docs will require lots of hardware. Saying "I usualy expect to see 50M docs on a node" gives them some confidence that it's not going to require a massive hardware investment and they can go forward with a PoC. OTOH I have other clients saying "We have 100B documents" and I have to say "You could be talking 200 nodes" which gives them incentive to do a PoC to get a hard number. I do recommend you keep adding (perhaps synthetic) docs to your node until it tips over. Finding your installation falls over at, say, 50M docs means you need to start taking action beforehand. OTOH if you load 150M docs on it and still function OK you can breathe a lot easier... Best, Erick On Wed, Apr 11, 2018 at 8:55 AM, Abhi Basu <9000r...@gmail.com> wrote: > *The BKM I have read so far (trying to find source) says 50 million > docs/shard performs well. I have found this in my recent tests as well. But > of course it depends on index structure, etc.* > > On Wed, Apr 11, 2018 at 10:37 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 4/11/2018 4:15 AM, neotorand wrote: >> > I believe heterogeneous data can be indexed to same collection and i can >> > have multiple shards for the index to be partitioned.So whats the need >> of a >> > second collection?. yes when collection size grows i should look for more >> > collection.what exactly that size is? what KPI drives the decision of >> having >> > more collection?Any pointers or links for best practice. >> >> There are no hard rules. Many factors affect these decisions. >> >> https://lucidworks.com/2012/07/23/sizing-hardware-in-the- >> abstract-why-we-dont-have-a-definitive-answer/ >> >> Creating multiple collections should be done when there is a logical or >> business reason for keeping different sets of data separate from each >> other. If there's never any need for people to query all the data at >> once, then it might make sense to use separate collections. Or you >> might want to put them together just for convenience, and use data in >> the index to filter the results to only the information that the user is >> allowed to access. >> >> > when should i go for multiple shards? >> > yes when shard size grows.Right? whats the size and how do i benchmark. >> >> Some indexes function really well with 300 million documents or more per >> shard. Other indexes struggle with less than a million per shard. It's >> impossible to give you any specific number. It depends on a bunch of >> factors. >> >> If query rate is very high, then you want to keep the shard count low. >> Using one shard might not be possible due to index size, but it should >> be as low as you can make it. You're also going to want to have a lot >> of replicas to handle the load. >> >> If query rate is extremely low, then sharding the index can actually >> *improve* performance, because there will be idle CPU capacity that can >> be used for the subqueries. >> >> Thanks, >> Shawn >> >> > > > -- > Abhi Basu