Thanks for the great advice Erick. I will experiment with your suggestions and see how it goes!
Chris On Sun, May 7, 2017 at 12:34 AM, Erick Erickson <erickerick...@gmail.com> wrote: > Well, you've been doing your homework ;). > > bq: I am a little confused on this statement you made: > > > Plus you can't commit > > individually, a commit on one will _still_ commit on all so you're > > right back where you started. > > Never mind. autocommit kicks off on a per replica basis. IOW, when a > new doc is indexed to a shard (really, any replica) the timer is > started. So if replica 1_1 gets a doc and replica 2_1 doesn't, there > is no commit on replica 2_1. My comment was mainly directed at the > idea that you might issue commits from the client, which are > distributed to all replicas. However, even in that case the a replica > that has received no updates won't do anything. > > About the hybrid approach. I've seen situations where essentially you > partition clients along "size" lines. So something like "put clients > on a shared single-shard collection as long as the aggregate number of > records is < X". The theory is that the update frequency is roughly > the same if you have 10 clients with 100K docs each .vs. one client > with 1M docs. So the pain of opening a new searcher is roughly the > same. "X" here is experimentally determined. > > Do note that moving from master/slave to SolrCloud will reduce > latency. In M/S, the time it takes to search is autocommit + poling > interval + autowarm time. Going to SolrCloud will remove the "polling > interval" from the equation. Not sure how much that helps.... > > There should be an autowarm statistic in the Solr logs BTW. Or some > messages about "opening searcher (some hex stuff) and another message > about when it's registered as active, along with timestamps. That'll > tell you how long it takes to autowarm. > > OK. "straw man" strategy for your case. Create a collection per > tenant. What you want to balance is where the collections are hosted. > Host a number of small tenants on the same Solr instance and fewer > larger tenants on other hardware. FWIW, I expect at least 25M docs per > Solr JVM (very hardware dependent of course), although testing is > important. > > Under the covers, each Solr instance establishes "watchers" on the > collections it hosts. So if a particular Solr hosts replicas for, say, > 10 collections, it establishes 10 watchers on the state.json zNode in > Zookeeper. 300 collections isn't all that much in recent Solr > installations. All that filtered through how beefy your hardware is of > course. > > Startup is an interesting case, but I've put 1,600 replicas on 4 Solr > instance on a Mac Pro (400 each). You can configure the number of > startup threads if starting up is too painful. > > So a cluster with 300 collections isn't really straining things. Some > of the literature is talking about thousands of collections. > > Good luck! > Erick > > On Sat, May 6, 2017 at 4:26 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > > Hi Erick, > > > > Thanks for the reply, I really appreciate it. > > > > To answer your questions, we have a little over 300 tenants, and a couple > > of different collections, the largest of which has ~11 million documents > > (so not terribly large). We are currently running standard Solr with > simple > > master/slave replication, so all of the documents are in a single solr > > core. We are planning to move to Solr cloud for various reasons, and as > > discussed previously, I am trying to find the best way to distribute the > > documents to serve a more NRT focused search case. > > > > I totally get your point on pushing back on NRT requirements, and I have > > done so for as long as I can. Currently our auto softcommit is set to 1 > > minute and we are able to achieve great query times with autowarming. > > Unfortunately, due to the nature of our application, our customers expect > > any changes they make to be visible almost immediately in search, and we > > have recently been getting a lot of complaints in this area, leading to > an > > initiative to drive down the time it takes for documents to become > visible > > in search. Which leaves me where I am now, trying to find the right > balance > > between document visibility and reasonable, stable, query times. > > > > Regarding autowarming, our autowarming times aren't too crazy. We are > > warming a max of 100 entries from the filter cache and it takes around > 5-10 > > seconds to complete on average. I suspect our biggest slow down during > > autowarming is the static warming query that we have that runs 10+ facets > > over the entire index. Our searches are very facet intensive, we use the > > JSON facet API to do some decently complex faceting (block joins, etc), > and > > for whatever reason, even though we use doc values for all of our facet > > fields, simply warming the filter cache doesn't seem to prevent a giant > > drop off in performance whenever a new searcher is opened. The only way I > > could find to prevent the giant performance dropoff was to run a static > > warming query on new searcher that runs all of our facets over the whole > > index. I haven't found a good way of telling how long this takes, as the > > JMX hooks for monitoring autowarming times don't seem to include static > > warming queries (that I can tell). > > > > Through experimentation I've found that by sharding my index in Solr > cloud, > > I can pretty much eliminate autowarming entirely and still achieve > > reasonable query times once the shards reach a small enough size (around > 1 > > million docs per shard). This is great, however, your assumptions as to > our > > tenant size distribution was spot on. Because of this, sharding naturally > > using the composite id router with the tenant ID as the key yields an > > uneven distribution of documents across shards. Basically whatever > unlucky > > small tenants happen to get stuck on the same shard as a gigantic tenant > > will suffer in performance because of it. That's why I was exploring the > > idea of having a tenant per collection or per shard, as a way of > isolating > > tenants from a performance perspective. > > > > I am a little confused on this statement you made: > > > >> Plus you can't commit > >> individually, a commit on one will _still_ commit on all so you're > >> right back where you started. > > > > > > We don't commit manually at all, we rely on auto softcommit to commit for > > us. My understanding was that since a shard is basically it's own solr > core > > under the covers, that indexing a document to a single shard would only > > open a new searcher (and thus invalidate caches) on that one shard, and > > thus separating tenants in their own shards would mean that tenant A > (shard > > A) updating it's documents would not require tenant B (shard B) to have > all > > of it's caches invalidated. Is this not correct? > > > > I'm also not sure I understand what you are saying regarding the hybrid > > approach you mentioned. You say to experiment with how many documents are > > ideal for a collection, but isn't the number of documents per shard the > > more meaningful number WRT performance? I apologize if I am being dense, > > maybe I'm not 100% clear on my terminology. My understanding was that a > > collection is a logical abstraction consisting of multiple > shards/replicas, > > and that the shards/replicas were actual physical solr cores. So for > > example, what is the difference between having 1000 collections with 1 > > shard each, vs 1 collection with 1000 shards? Both cases will end up with > > the same amount of physical solr cores right? Or am I completely off > base? > > > > Thanks again, > > > > Chris > > > > On Sat, May 6, 2017 at 10:36 AM, Erick Erickson <erickerick...@gmail.com > > > > wrote: > > > >> Well, it's not either/or. And you haven't said how many tenants we're > >> talking about here. Solr startup times for a single _instance_ of Solr > >> when there are thousands of collections can be slow. > >> > >> But note what I am talking about here: A single Solr on a single node > >> where there are hundreds and hundreds of collections (or replicas for > >> that matter). I know of very large installations with 100s of > >> thousands of _replicas_ that run. Admittedly with a lot of care and > >> feeding... > >> > >> Sharding a single large collection and using custom routing to push > >> tenants to a single shard will be an administrative problem for you. > >> I'm assuming you have the typical multi-tenant problems, a bunch of > >> tenants have around N docs, some smaller percentage have 3N and a few > >> have 100N. Now you're having to keep track of how many docs are on > >> each shard, do the routing yourself, etc. Plus you can't commit > >> individually, a commit on one will _still_ commit on all so you're > >> right back where you started. > >> > >> I've seen people use a hybrid approach: experiment with how many > >> _documents_ you can have in a collection (however you partition that > >> up) and use the multi-tenant approach. So you have N collections and > >> each collection has a (varying) number of tenants. This also tends to > >> flatten out the update process on the assumption that your smaller > >> tenants also don't update their data as often. > >> > >> However, I really have to question one of your basic statements: > >> > >> "This works fine with aggressive autowarming, but I have a need to > reduce > >> my NRT > >> search capabilities to seconds as opposed to the minutes it is at > now,"... > >> > >> The implication here is that your autowarming takes minutes. Very > >> often people severely overdo the warmup by setting their autowarm > >> counts to 100s or 1000s. This is rarely necessary, especially if you > >> use docValues fields appropriately. Very often much of autowarming is > >> "uninverting" fields (look in your Solr log). Essentially for any > >> field you see this, use docValues and loading will be much faster. > >> > >> You also haven't said how many documents you have in a shard at > >> present. This is actually the metric I use most often to size > >> hardware. I claim you can find a sweet spot where minimal autowarming > >> will give you good enough performance, and that number is what you > >> should design to. Of course YMMV. > >> > >> Finally: push back really hard on how aggressive NRT support needs to > >> be. Often "requirements" like this are made without much thought as > >> "faster is better, let's make it 1 second!". There are situations > >> where that's true, but it comes at a cost. Users may be better served > >> by a predictable but fast system than one that's fast but > >> unpredictable. "Documents may take up to 5 minutes to appear and > >> searches will usually take less than a second" is nice and concise. I > >> have my expectations. "Documents are searchable in 1 second, but the > >> results may not come back for between 1 and 10 seconds" is much more > >> frustrating. > >> > >> FWIW, > >> Erick > >> > >> On Sat, May 6, 2017 at 5:12 AM, Chris Troullis <cptroul...@gmail.com> > >> wrote: > >> > Hi, > >> > > >> > I use Solr to serve multiple tenants and currently all tenant's data > >> > resides in one large collection, and queries have a tenant identifier. > >> This > >> > works fine with aggressive autowarming, but I have a need to reduce my > >> NRT > >> > search capabilities to seconds as opposed to the minutes it is at now, > >> > which will mean drastically reducing if not eliminating my > autowarming. > >> As > >> > such I am considering splitting my index out by tenant so that when > one > >> > tenant modifies their data it doesn't blow away all of the searcher > based > >> > caches for all tenants on soft commit. > >> > > >> > I have done a lot of research on the subject and it seems like Solr > Cloud > >> > can have problems handling large numbers of collections. I'm obviously > >> > going to have to run some tests to see how it performs, but my main > >> > question is this: are there pros and cons to splitting the index into > >> > multiple collections vs having 1 collection but splitting into > multiple > >> > shards? In my case I would have a shard per tenant and use implicit > >> routing > >> > to route to that specific shard. As I understand it a shard is > basically > >> > it's own lucene index, so I would still be eating that overhead with > >> either > >> > approach. What I don't know is if there are any other overheads > involved > >> > WRT collections vs shards, routing, zookeeper, etc. > >> > > >> > Thanks, > >> > > >> > Chris > >> >