Re: Suggestions for debugging performance issue

Chris Troullis Wed, 13 Jun 2018 13:41:08 -0700

Hi Susheel,

It's not drastically different no. There are other collections with more
fields and more documents that don't have this issue. And the collection is
not sharded. Just 1 shard with 2 replicas. Both replicas are similar in
response time.


Thanks,
Chris

On Wed, Jun 13, 2018 at 2:37 PM, Susheel Kumar <susheel2...@gmail.com>
wrote:

> Is this collection anyway drastically different than others in terms of
> schema/# of fields/total document etc is it sharded and if so can you look
> which shard taking more time with shard.info=true.
>
> Thnx
> Susheel
>
> On Wed, Jun 13, 2018 at 2:29 PM, Chris Troullis <cptroul...@gmail.com>
> wrote:
>
> > Thanks Erick,
> >
> > Seems to be a mixed bag in terms of tlog size across all of our indexes,
> > but currently the index with the performance issues has 4 tlog files
> > totally ~200 MB. This still seems high to me since the collections are in
> > sync, and we hard commit every minute, but it's less than the ~8GB it was
> > before we cleaned them up. Spot checking some other indexes show some
> have
> > tlogs >3GB, but none of those indexes are having performance issues (on
> the
> > same solr node), so I'm not sure it's related. We have 13 collections of
> > various sizes running on our solr cloud cluster, and none of them seem to
> > have this issue except for this one index, which is not our largest index
> > in terms of size on disk or number of documents.
> >
> > As far as the response intervals, just running a default search *:*
> sorting
> > on our id field so that we get consistent results across environments,
> and
> > returning 200 results (our max page size in app) with ~20 fields, we see
> > times of ~3.5 seconds in production, compared to ~1 second on one of our
> > lower environments with an exact copy of the index. Both have CDCR
> enabled
> > and have identical clusters.
> >
> > Unfortunately, currently the only instance we are seeing the issue on is
> > production, so we are limited in the tests that we can run. I did confirm
> > in the lower environment that the doc cache is large enough to hold all
> of
> > the results, and that both the doc and query caches should be serving the
> > results. Obviously production we have much more indexing going on, but we
> > do utilize autowarming for our caches so our response times are still
> > stable across new searchers.
> >
> > We did move the lower environment to the same ESX host as our production
> > cluster, so that it is getting resources from the same pool (CPU, RAM,
> > etc). The only thing that is different is the disks, but the lower
> > environment is running on slower disks than production. And if it was a
> > disk issue you would think it would be affecting all of the collections,
> > not just this one.
> >
> > It's a mystery!
> >
> > Chris
> >
> >
> >
> > On Wed, Jun 13, 2018 at 10:38 AM, Erick Erickson <
> erickerick...@gmail.com>
> > wrote:
> >
> > > First, nice job of eliminating all the standard stuff!
> > >
> > > About tlogs: Sanity check: They aren't growing again, right? They
> > > should hit a relatively steady state. The tlogs are used as a queueing
> > > mechanism for CDCR to durably store updates until they can
> > > successfully be transmitted to the target. So I'd expect them to hit a
> > > fairly steady number.
> > >
> > > Your lack of CPU/IO spikes is also indicative of something weird,
> > > somehow Solr just sitting around doing nothing. What intervals are we
> > > talking about here for response? 100ms? 5000ms?
> > >
> > > When you hammer the same query over and over, you should see your
> > > queryResultCache hits increase. If that's the case, Solr is doing no
> > > work at all for the search, just assembling the resopnse packet which,
> > > as you say, should be in the documentCache. This assumes it's big
> > > enough to hold all of the docs that are requested by all the
> > > simultaneous requests. The queryResultCache cache will be flushed
> > > every time a new searcher is opened. So if you still get your poor
> > > response times, and your queryResultCache hits are increasing then
> > > Solr is doing pretty much nothing.
> > >
> > > So does this behavior still occur if you aren't adding docs to the
> > > index? If you turn indexing off as a test, that'd be another data
> > > point.
> > >
> > > And, of course, if it's at all possible to just take the CDCR
> > > configuration out of your solrconfig file temporarily that'd nail
> > > whether CDCR is the culprit or whether it's coincidental. You say that
> > > CDCR is the only difference between the environments, but I've
> > > certainly seen situations where it turns out to be a bad disk
> > > controller or something that's _also_ different.
> > >
> > > Now, assuming all that's inconclusive, I'm afraid the next step would
> > > be to throw a profiler at it. Maybe pull a stack traces.
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Jun 13, 2018 at 6:15 AM, Chris Troullis <cptroul...@gmail.com>
> > > wrote:
> > > > Thanks Erick. A little more info:
> > > >
> > > > -We do have buffering disabled everywhere, as I had read multiple
> posts
> > > on
> > > > the mailing list regarding the issue you described.
> > > > -We soft commit (with opensearcher=true) pretty frequently (15
> seconds)
> > > as
> > > > we have some NRT requirements. We hard commit every 60 seconds. We
> > never
> > > > commit manually, only via the autocommit timers. We have been using
> > these
> > > > settings for a long time and have never had any issues until
> recently.
> > > And
> > > > all of our other indexes are fine (some larger than this one).
> > > > -We do have documentResultCache enabled, although it's not very big.
> > But
> > > I
> > > > can literally spam the same query over and over again with no other
> > > queries
> > > > hitting the box, so all the results should be cached.
> > > > -We don't see any CPU/IO spikes when running these queries, our load
> is
> > > > pretty much flat on all accounts.
> > > >
> > > > I know it seems odd that CDCR would be the culprit, but it's really
> the
> > > > only thing we've changed, and we have other environments running the
> > > exact
> > > > same setup with no issues, so it is really making us tear our hair
> out.
> > > And
> > > > when we cleaned up the huge tlogs it didn't seem to make any
> difference
> > > in
> > > > the query time (I was originally thinking it was somehow searching
> > > through
> > > > the tlogs for documents, and that's why it was taking so long to
> > retrieve
> > > > the results, but I don't know if that is actually how it works).
> > > >
> > > > Are you aware of any logger settings we could increase to potentially
> > > get a
> > > > better idea of where the time is being spent? I took the eventual
> query
> > > > response and just hosted as a static file on the same machine via
> nginx
> > > and
> > > > it downloaded lightning fast (I was trying to rule out network as the
> > > > culprit), so it seems like the time is being spent somewhere in solr.
> > > >
> > > > Thanks,
> > > > Chris
> > > >
> > > > On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > >> Having the tlogs be huge is a red flag. Do you have buffering
> enabled
> > > >> in CDCR? This was something of a legacy option that's going to be
> > > >> removed, it's been made obsolete by the ability of CDCR to bootstrap
> > > >> the entire index. Buffering should be disabled always.
> > > >>
> > > >> Another reason tlogs can grow is if you have very long times between
> > > >> hard commits. I doubt that's your issue, but just in case.
> > > >>
> > > >> And the final reason tlogs can grow is that the connection between
> > > >> source and target clusters is broken, but that doesn't sound like
> what
> > > >> you're seeing either since you say the target cluster is keeping up.
> > > >>
> > > >> The process of assembling the response can be long. If you have any
> > > >> stored fields (and not docValues-enabled), Solr will
> > > >> 1> seek the stored data on disk
> > > >> 2> decompress (min 16K blocks)
> > > >> 3> transmit the thing back to your client
> > > >>
> > > >> The decompressed version of the doc will be held in the
> > > >> documentResultCache configured in solrconfig.xml, so it may or may
> not
> > > >> be cached in memory. That said, this stuff is all MemMapped and the
> > > >> decompression isn't usually an issue, I'd expect you to see very
> large
> > > >> CPU spikes and/or I/O contention if that was the case.
> > > >>
> > > >> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> > > >> have to look in the tlogs to get you the very most recent copy, so
> the
> > > >> first place I'd look is keeping the tlogs under control first.
> > > >>
> > > >> The other possibility (again unrelated to CDCR) is if your spikes
> are
> > > >> coincident with soft commits or hard-commits-with-
> opensearcher-true.
> > > >>
> > > >> In all, though, none of the usual suspects seems to make sense here
> > > >> since you say that absent configuring CDCR things seem to run fine.
> So
> > > >> I'd look at the tlogs and my commit intervals. Once the tlogs are
> > > >> under control then move on to other possibilities if the problem
> > > >> persists...
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >>
> > > >> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <
> > cptroul...@gmail.com>
> > > >> wrote:
> > > >> > Hi all,
> > > >> >
> > > >> > Recently we have gone live using CDCR on our 2 node solr cloud
> > cluster
> > > >> > (7.2.1). From a CDCR perspective, everything seems to be working
> > > >> > fine...collections are staying in sync across the cluster,
> > everything
> > > >> looks
> > > >> > good.
> > > >> >
> > > >> > The issue we are seeing is with 1 collection in particular, after
> we
> > > set
> > > >> up
> > > >> > CDCR, we are getting extremely slow response times when retrieving
> > > >> > documents. Debugging the query shows QTime is almost nothing, but
> > the
> > > >> > overall responseTime is like 5x what it should be. The problem is
> > > >> > exacerbated by larger result sizes. IE retrieving 25 results is
> > almost
> > > >> > normal, but 200 results is way slower than normal. I can run the
> > exact
> > > >> same
> > > >> > query multiple times in a row (so everything should be cached),
> and
> > I
> > > >> still
> > > >> > see response times way higher than another environment that is not
> > > using
> > > >> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled,
> just
> > > that
> > > >> > we are using the CDCRUpdateLog. The problem started happening even
> > > before
> > > >> > we enabled CDCR.
> > > >> >
> > > >> > In a lower environment we noticed that the transaction logs were
> > huge
> > > >> > (multiple gigs), so we tried stopping solr and deleting the tlogs
> > then
> > > >> > restarting, and that seemed to fix the performance issue. We tried
> > the
> > > >> same
> > > >> > thing in production the other day but it had no effect, so now I
> > don't
> > > >> know
> > > >> > if it was a coincidence or not.
> > > >> >
> > > >> > Things that we have tried:
> > > >> >
> > > >> > -Completely deleting the collection and rebuilding from scratch
> > > >> > -Running the query directly from solr admin to eliminate other
> > causes
> > > >> > -Doing a tcpdump on the solr node to eliminate a network issue
> > > >> >
> > > >> > None of these things have yielded any results. It seems very
> > > >> inconsistent.
> > > >> > Some environments we can reproduce it in, others we can't.
> > > >> > Hardware/configuration/network is exactly the same between all
> > > >> > envrionments. The only thing that we have narrowed it down to is
> we
> > > are
> > > >> > pretty sure it has something to do with CDCR, as the issue only
> > > started
> > > >> > when we started using it.
> > > >> >
> > > >> > I'm wondering if any of this sparks any ideas from anyone, or if
> > > people
> > > >> > have suggestions as to how I can figure out what is causing this
> > long
> > > >> query
> > > >> > response time? The debug flag on the query seems more geared
> towards
> > > >> seeing
> > > >> > where time is spent in the actual query, which is nothing in my
> > case.
> > > The
> > > >> > time is spent retrieving the results, which I don't have much
> > > information
> > > >> > on. I have tried increasing the log level but nothing jumps out at
> > me
> > > in
> > > >> > the solr logs. Is there something I can look for specifically to
> > help
> > > >> debug
> > > >> > this?
> > > >> >
> > > >> > Thanks,
> > > >> >
> > > >> > Chris
> > > >>
> > >
> >
>

Re: Suggestions for debugging performance issue

Reply via email to