Hi Susheel, It's not drastically different no. There are other collections with more fields and more documents that don't have this issue. And the collection is not sharded. Just 1 shard with 2 replicas. Both replicas are similar in response time.
Thanks, Chris On Wed, Jun 13, 2018 at 2:37 PM, Susheel Kumar <susheel2...@gmail.com> wrote: > Is this collection anyway drastically different than others in terms of > schema/# of fields/total document etc is it sharded and if so can you look > which shard taking more time with shard.info=true. > > Thnx > Susheel > > On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis <cptroul...@gmail.com> > wrote: > > > Thanks Erick, > > > > Seems to be a mixed bag in terms of tlog size across all of our indexes, > > but currently the index with the performance issues has 4 tlog files > > totally ~200 MB. This still seems high to me since the collections are in > > sync, and we hard commit every minute, but it's less than the ~8GB it was > > before we cleaned them up. Spot checking some other indexes show some > have > > tlogs >3GB, but none of those indexes are having performance issues (on > the > > same solr node), so I'm not sure it's related. We have 13 collections of > > various sizes running on our solr cloud cluster, and none of them seem to > > have this issue except for this one index, which is not our largest index > > in terms of size on disk or number of documents. > > > > As far as the response intervals, just running a default search *:* > sorting > > on our id field so that we get consistent results across environments, > and > > returning 200 results (our max page size in app) with ~20 fields, we see > > times of ~3.5 seconds in production, compared to ~1 second on one of our > > lower environments with an exact copy of the index. Both have CDCR > enabled > > and have identical clusters. > > > > Unfortunately, currently the only instance we are seeing the issue on is > > production, so we are limited in the tests that we can run. I did confirm > > in the lower environment that the doc cache is large enough to hold all > of > > the results, and that both the doc and query caches should be serving the > > results. Obviously production we have much more indexing going on, but we > > do utilize autowarming for our caches so our response times are still > > stable across new searchers. > > > > We did move the lower environment to the same ESX host as our production > > cluster, so that it is getting resources from the same pool (CPU, RAM, > > etc). The only thing that is different is the disks, but the lower > > environment is running on slower disks than production. And if it was a > > disk issue you would think it would be affecting all of the collections, > > not just this one. > > > > It's a mystery! > > > > Chris > > > > > > > > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson < > erickerick...@gmail.com> > > wrote: > > > > > First, nice job of eliminating all the standard stuff! > > > > > > About tlogs: Sanity check: They aren't growing again, right? They > > > should hit a relatively steady state. The tlogs are used as a queueing > > > mechanism for CDCR to durably store updates until they can > > > successfully be transmitted to the target. So I'd expect them to hit a > > > fairly steady number. > > > > > > Your lack of CPU/IO spikes is also indicative of something weird, > > > somehow Solr just sitting around doing nothing. What intervals are we > > > talking about here for response? 100ms? 5000ms? > > > > > > When you hammer the same query over and over, you should see your > > > queryResultCache hits increase. If that's the case, Solr is doing no > > > work at all for the search, just assembling the resopnse packet which, > > > as you say, should be in the documentCache. This assumes it's big > > > enough to hold all of the docs that are requested by all the > > > simultaneous requests. The queryResultCache cache will be flushed > > > every time a new searcher is opened. So if you still get your poor > > > response times, and your queryResultCache hits are increasing then > > > Solr is doing pretty much nothing. > > > > > > So does this behavior still occur if you aren't adding docs to the > > > index? If you turn indexing off as a test, that'd be another data > > > point. > > > > > > And, of course, if it's at all possible to just take the CDCR > > > configuration out of your solrconfig file temporarily that'd nail > > > whether CDCR is the culprit or whether it's coincidental. You say that > > > CDCR is the only difference between the environments, but I've > > > certainly seen situations where it turns out to be a bad disk > > > controller or something that's _also_ different. > > > > > > Now, assuming all that's inconclusive, I'm afraid the next step would > > > be to throw a profiler at it. Maybe pull a stack traces. > > > > > > Best, > > > Erick > > > > > > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cptroul...@gmail.com> > > > wrote: > > > > Thanks Erick. A little more info: > > > > > > > > -We do have buffering disabled everywhere, as I had read multiple > posts > > > on > > > > the mailing list regarding the issue you described. > > > > -We soft commit (with opensearcher=true) pretty frequently (15 > seconds) > > > as > > > > we have some NRT requirements. We hard commit every 60 seconds. We > > never > > > > commit manually, only via the autocommit timers. We have been using > > these > > > > settings for a long time and have never had any issues until > recently. > > > And > > > > all of our other indexes are fine (some larger than this one). > > > > -We do have documentResultCache enabled, although it's not very big. > > But > > > I > > > > can literally spam the same query over and over again with no other > > > queries > > > > hitting the box, so all the results should be cached. > > > > -We don't see any CPU/IO spikes when running these queries, our load > is > > > > pretty much flat on all accounts. > > > > > > > > I know it seems odd that CDCR would be the culprit, but it's really > the > > > > only thing we've changed, and we have other environments running the > > > exact > > > > same setup with no issues, so it is really making us tear our hair > out. > > > And > > > > when we cleaned up the huge tlogs it didn't seem to make any > difference > > > in > > > > the query time (I was originally thinking it was somehow searching > > > through > > > > the tlogs for documents, and that's why it was taking so long to > > retrieve > > > > the results, but I don't know if that is actually how it works). > > > > > > > > Are you aware of any logger settings we could increase to potentially > > > get a > > > > better idea of where the time is being spent? I took the eventual > query > > > > response and just hosted as a static file on the same machine via > nginx > > > and > > > > it downloaded lightning fast (I was trying to rule out network as the > > > > culprit), so it seems like the time is being spent somewhere in solr. > > > > > > > > Thanks, > > > > Chris > > > > > > > > On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson < > > erickerick...@gmail.com > > > > > > > > wrote: > > > > > > > >> Having the tlogs be huge is a red flag. Do you have buffering > enabled > > > >> in CDCR? This was something of a legacy option that's going to be > > > >> removed, it's been made obsolete by the ability of CDCR to bootstrap > > > >> the entire index. Buffering should be disabled always. > > > >> > > > >> Another reason tlogs can grow is if you have very long times between > > > >> hard commits. I doubt that's your issue, but just in case. > > > >> > > > >> And the final reason tlogs can grow is that the connection between > > > >> source and target clusters is broken, but that doesn't sound like > what > > > >> you're seeing either since you say the target cluster is keeping up. > > > >> > > > >> The process of assembling the response can be long. If you have any > > > >> stored fields (and not docValues-enabled), Solr will > > > >> 1> seek the stored data on disk > > > >> 2> decompress (min 16K blocks) > > > >> 3> transmit the thing back to your client > > > >> > > > >> The decompressed version of the doc will be held in the > > > >> documentResultCache configured in solrconfig.xml, so it may or may > not > > > >> be cached in memory. That said, this stuff is all MemMapped and the > > > >> decompression isn't usually an issue, I'd expect you to see very > large > > > >> CPU spikes and/or I/O contention if that was the case. > > > >> > > > >> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will > > > >> have to look in the tlogs to get you the very most recent copy, so > the > > > >> first place I'd look is keeping the tlogs under control first. > > > >> > > > >> The other possibility (again unrelated to CDCR) is if your spikes > are > > > >> coincident with soft commits or hard-commits-with- > opensearcher-true. > > > >> > > > >> In all, though, none of the usual suspects seems to make sense here > > > >> since you say that absent configuring CDCR things seem to run fine. > So > > > >> I'd look at the tlogs and my commit intervals. Once the tlogs are > > > >> under control then move on to other possibilities if the problem > > > >> persists... > > > >> > > > >> Best, > > > >> Erick > > > >> > > > >> > > > >> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis < > > cptroul...@gmail.com> > > > >> wrote: > > > >> > Hi all, > > > >> > > > > >> > Recently we have gone live using CDCR on our 2 node solr cloud > > cluster > > > >> > (7.2.1). From a CDCR perspective, everything seems to be working > > > >> > fine...collections are staying in sync across the cluster, > > everything > > > >> looks > > > >> > good. > > > >> > > > > >> > The issue we are seeing is with 1 collection in particular, after > we > > > set > > > >> up > > > >> > CDCR, we are getting extremely slow response times when retrieving > > > >> > documents. Debugging the query shows QTime is almost nothing, but > > the > > > >> > overall responseTime is like 5x what it should be. The problem is > > > >> > exacerbated by larger result sizes. IE retrieving 25 results is > > almost > > > >> > normal, but 200 results is way slower than normal. I can run the > > exact > > > >> same > > > >> > query multiple times in a row (so everything should be cached), > and > > I > > > >> still > > > >> > see response times way higher than another environment that is not > > > using > > > >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, > just > > > that > > > >> > we are using the CDCRUpdateLog. The problem started happening even > > > before > > > >> > we enabled CDCR. > > > >> > > > > >> > In a lower environment we noticed that the transaction logs were > > huge > > > >> > (multiple gigs), so we tried stopping solr and deleting the tlogs > > then > > > >> > restarting, and that seemed to fix the performance issue. We tried > > the > > > >> same > > > >> > thing in production the other day but it had no effect, so now I > > don't > > > >> know > > > >> > if it was a coincidence or not. > > > >> > > > > >> > Things that we have tried: > > > >> > > > > >> > -Completely deleting the collection and rebuilding from scratch > > > >> > -Running the query directly from solr admin to eliminate other > > causes > > > >> > -Doing a tcpdump on the solr node to eliminate a network issue > > > >> > > > > >> > None of these things have yielded any results. It seems very > > > >> inconsistent. > > > >> > Some environments we can reproduce it in, others we can't. > > > >> > Hardware/configuration/network is exactly the same between all > > > >> > envrionments. The only thing that we have narrowed it down to is > we > > > are > > > >> > pretty sure it has something to do with CDCR, as the issue only > > > started > > > >> > when we started using it. > > > >> > > > > >> > I'm wondering if any of this sparks any ideas from anyone, or if > > > people > > > >> > have suggestions as to how I can figure out what is causing this > > long > > > >> query > > > >> > response time? The debug flag on the query seems more geared > towards > > > >> seeing > > > >> > where time is spent in the actual query, which is nothing in my > > case. > > > The > > > >> > time is spent retrieving the results, which I don't have much > > > information > > > >> > on. I have tried increasing the log level but nothing jumps out at > > me > > > in > > > >> > the solr logs. Is there something I can look for specifically to > > help > > > >> debug > > > >> > this? > > > >> > > > > >> > Thanks, > > > >> > > > > >> > Chris > > > >> > > > > > >