Thanks Erick. A little more info:

-We do have buffering disabled everywhere, as I had read multiple posts on
the mailing list regarding the issue you described.
-We soft commit (with opensearcher=true) pretty frequently (15 seconds) as
we have some NRT requirements. We hard commit every 60 seconds. We never
commit manually, only via the autocommit timers. We have been using these
settings for a long time and have never had any issues until recently. And
all of our other indexes are fine (some larger than this one).
-We do have documentResultCache enabled, although it's not very big. But I
can literally spam the same query over and over again with no other queries
hitting the box, so all the results should be cached.
-We don't see any CPU/IO spikes when running these queries, our load is
pretty much flat on all accounts.

I know it seems odd that CDCR would be the culprit, but it's really the
only thing we've changed, and we have other environments running the exact
same setup with no issues, so it is really making us tear our hair out. And
when we cleaned up the huge tlogs it didn't seem to make any difference in
the query time (I was originally thinking it was somehow searching through
the tlogs for documents, and that's why it was taking so long to retrieve
the results, but I don't know if that is actually how it works).

Are you aware of any logger settings we could increase to potentially get a
better idea of where the time is being spent? I took the eventual query
response and just hosted as a static file on the same machine via nginx and
it downloaded lightning fast (I was trying to rule out network as the
culprit), so it seems like the time is being spent somewhere in solr.

Thanks,
Chris

On Tue, Jun 12, 2018 at 2:45 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Having the tlogs be huge is a red flag. Do you have buffering enabled
> in CDCR? This was something of a legacy option that's going to be
> removed, it's been made obsolete by the ability of CDCR to bootstrap
> the entire index. Buffering should be disabled always.
>
> Another reason tlogs can grow is if you have very long times between
> hard commits. I doubt that's your issue, but just in case.
>
> And the final reason tlogs can grow is that the connection between
> source and target clusters is broken, but that doesn't sound like what
> you're seeing either since you say the target cluster is keeping up.
>
> The process of assembling the response can be long. If you have any
> stored fields (and not docValues-enabled), Solr will
> 1> seek the stored data on disk
> 2> decompress (min 16K blocks)
> 3> transmit the thing back to your client
>
> The decompressed version of the doc will be held in the
> documentResultCache configured in solrconfig.xml, so it may or may not
> be cached in memory. That said, this stuff is all MemMapped and the
> decompression isn't usually an issue, I'd expect you to see very large
> CPU spikes and/or I/O contention if that was the case.
>
> CDCR shouldn't really be that much of a hit, mostly I/O. Solr will
> have to look in the tlogs to get you the very most recent copy, so the
> first place I'd look is keeping the tlogs under control first.
>
> The other possibility (again unrelated to CDCR) is if your spikes are
> coincident with soft commits or hard-commits-with-opensearcher-true.
>
> In all, though, none of the usual suspects seems to make sense here
> since you say that absent configuring CDCR things seem to run fine. So
> I'd look at the tlogs and my commit intervals. Once the tlogs are
> under control then move on to other possibilities if the problem
> persists...
>
> Best,
> Erick
>
>
> On Tue, Jun 12, 2018 at 11:06 AM, Chris Troullis <cptroul...@gmail.com>
> wrote:
> > Hi all,
> >
> > Recently we have gone live using CDCR on our 2 node solr cloud cluster
> > (7.2.1). From a CDCR perspective, everything seems to be working
> > fine...collections are staying in sync across the cluster, everything
> looks
> > good.
> >
> > The issue we are seeing is with 1 collection in particular, after we set
> up
> > CDCR, we are getting extremely slow response times when retrieving
> > documents. Debugging the query shows QTime is almost nothing, but the
> > overall responseTime is like 5x what it should be. The problem is
> > exacerbated by larger result sizes. IE retrieving 25 results is almost
> > normal, but 200 results is way slower than normal. I can run the exact
> same
> > query multiple times in a row (so everything should be cached), and I
> still
> > see response times way higher than another environment that is not using
> > CDCR. It doesn't seem to matter if CDCR is enabled or disabled, just that
> > we are using the CDCRUpdateLog. The problem started happening even before
> > we enabled CDCR.
> >
> > In a lower environment we noticed that the transaction logs were huge
> > (multiple gigs), so we tried stopping solr and deleting the tlogs then
> > restarting, and that seemed to fix the performance issue. We tried the
> same
> > thing in production the other day but it had no effect, so now I don't
> know
> > if it was a coincidence or not.
> >
> > Things that we have tried:
> >
> > -Completely deleting the collection and rebuilding from scratch
> > -Running the query directly from solr admin to eliminate other causes
> > -Doing a tcpdump on the solr node to eliminate a network issue
> >
> > None of these things have yielded any results. It seems very
> inconsistent.
> > Some environments we can reproduce it in, others we can't.
> > Hardware/configuration/network is exactly the same between all
> > envrionments. The only thing that we have narrowed it down to is we are
> > pretty sure it has something to do with CDCR, as the issue only started
> > when we started using it.
> >
> > I'm wondering if any of this sparks any ideas from anyone, or if people
> > have suggestions as to how I can figure out what is causing this long
> query
> > response time? The debug flag on the query seems more geared towards
> seeing
> > where time is spent in the actual query, which is nothing in my case. The
> > time is spent retrieving the results, which I don't have much information
> > on. I have tried increasing the log level but nothing jumps out at me in
> > the solr logs. Is there something I can look for specifically to help
> debug
> > this?
> >
> > Thanks,
> >
> > Chris
>

Reply via email to