Re: Ideas for debugging poor SolrCloud scalability

Peter Keegan Fri, 31 Oct 2014 12:21:24 -0700

Regarding batch indexing:
When I send batches of 1000 docs to a standalone Solr server, the log file
reports "(1000 adds)" in LogUpdateProcessor. But when I send them to the
leader of a replicated index, the leader log file reports much smaller
numbers, usually "(12 adds)". Why do the batches appear to be broken up?


Peter

On Fri, Oct 31, 2014 at 10:40 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> NP, just making sure.
>
> I suspect you'll get lots more bang for the buck, and
> results much more closely matching your expectations if
>
> 1> you batch up a bunch of docs at once rather than
> sending them one at a time. That's probably the easiest
> thing to try. Sending docs one at a time is something of
> an anti-pattern. I usually start with batches of 1,000.
>
> And just to check.. You're not issuing any commits from the
> client, right? Performance will be terrible if you issue commits
> after every doc, that's totally an anti-pattern. Doubly so for
> optimizes.... Since you showed us your solrconfig  autocommit
> settings I'm assuming not but want to be sure.
>
> 2> use a leader-aware client. I'm totally unfamiliar with Go,
> so I have no suggestions whatsoever to offer there.... But you'll
> want to batch in this case too.
>
> On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose <ianr...@fullstory.com> wrote:
> > Hi Erick -
> >
> > Thanks for the detailed response and apologies for my confusing
> > terminology.  I should have said "WPS" (writes per second) instead of QPS
> > but I didn't want to introduce a weird new acronym since QPS is well
> > known.  Clearly a bad decision on my part.  To clarify: I am doing
> > *only* writes
> > (document adds).  Whenever I wrote "QPS" I was referring to writes.
> >
> > It seems clear at this point that I should wrap up the code to do "smart"
> > routing rather than choose Solr nodes randomly.  And then see if that
> > changes things.  I must admit that although I understand that random node
> > selection will impose a performance hit, theoretically it seems to me
> that
> > the system should still scale up as you add more nodes (albeit at lower
> > absolute level of performance than if you used a smart router).
> > Nonetheless, I'm just theorycrafting here so the better thing to do is
> just
> > try it experimentally.  I hope to have that working today - will report
> > back on my findings.
> >
> > Cheers,
> > - Ian
> >
> > p.s. To clarify why we are rolling our own smart router code, we use Go
> > over here rather than Java.  Although if we still get bad performance
> with
> > our custom Go router I may try a pure Java load client using
> > CloudSolrServer to eliminate the possibility of bugs in our
> implementation.
> >
> >
> > On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> I'm really confused:
> >>
> >> bq: I am not issuing any queries, only writes (document inserts)
> >>
> >> bq: It's clear that once the load test client has ~40 simulated users
> >>
> >> bq: A cluster of 3 shards over 3 Solr nodes *should* support
> >> a higher QPS than 2 shards over 2 Solr nodes, right
> >>
> >> QPS is usually used to mean "Queries Per Second", which is different
> from
> >> the statement that "I am not issuing any queries....". And what do the
> >> number of users have to do with inserting documents?
> >>
> >> You also state: " In many cases, CPU on the solr servers is quite low as
> >> well"
> >>
> >> So let's talk about indexing first. Indexing should scale nearly
> >> linearly as long as
> >> 1> you are routing your docs to the correct leader, which happens with
> >> SolrJ
> >> and the CloudSolrSever automatically. Rather than rolling your own, I
> >> strongly
> >> suggest you try this out.
> >> 2> you have enough clients feeding the cluster to push CPU utilization
> >> on them all.
> >> Very often "slow indexing", or in your case "lack of scaling" is a
> >> result of document
> >> acquisition or, in your case, your doc generator is spending all it's
> >> time waiting for
> >> the individual documents to get to Solr and come back.
> >>
> >> bq: "chooses a random solr server for each ADD request (with 1 doc per
> add
> >> request)"
> >>
> >> Probably your culprit right there. Each and every document requires that
> >> you
> >> have to cross the network (and forward that doc to the correct leader).
> So
> >> given
> >> that you're not seeing high CPU utilization, I suspect that you're not
> >> sending
> >> enough docs to SolrCloud fast enough to see scaling. You need to batch
> up
> >> multiple docs, I generally send 1,000 docs at a time.
> >>
> >> But even if you do solve this, the inter-node routing will prevent
> >> linear scaling.
> >> When a doc (or a batch of docs) goes to a random Solr node, here's what
> >> happens:
> >> 1> the docs are re-packaged into groups based on which shard they're
> >> destined for
> >> 2> the sub-packets are forwarded to the leader for each shard
> >> 3> the responses are gathered back and returned to the client.
> >>
> >> This set of operations will eventually degrade the scaling.
> >>
> >> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
> >> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
> idea
> >> behind sharding.
> >>
> >> If we're talking search requests, the answer is no. Sharding is
> >> what you do when your collection no longer fits on a single node.
> >> If it _does_ fit on a single node, then you'll usually get better query
> >> performance by adding a bunch of replicas to a single shard. When
> >> the number of  docs on each shard grows large enough that you
> >> no longer get good query performance, _then_ you shard. And
> >> take the query hit.
> >>
> >> If we're talking about inserts, then see above. I suspect your problem
> is
> >> that you're _not_ "saturating the SolrCloud cluster", you're sending
> >> docs to Solr very inefficiently and waiting on I/O. Batching docs and
> >> sending them to the right leader should scale pretty linearly until you
> >> start saturating your network.
> >>
> >> Best,
> >> Erick
> >>
> >> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose <ianr...@fullstory.com>
> wrote:
> >> > Thanks for the suggestions so for, all.
> >> >
> >> > 1) We are not using SolrJ on the client (not using Java at all) but I
> am
> >> > working on writing a "smart" router so that we can always send to the
> >> > correct node.  I am certainly curious to see how that changes things.
> >> > Nonetheless even with the overhead of extra routing hops, the observed
> >> > behavior (no increase in performance with more nodes) doesn't make any
> >> > sense to me.
> >> >
> >> > 2) Commits: we are using autoCommit with openSearcher=false
> >> (maxTime=60000)
> >> > and autoSoftCommit (maxTime=15000).
> >> >
> >> > 3) Suggestions to batch documents certainly make sense for production
> >> code
> >> > but in this case I am not real concerned with absolute performance; I
> >> just
> >> > want to see the *relative* performance as we use more Solr nodes.  So
> I
> >> > don't think batching or not really matters.
> >> >
> >> > 4) "No, that won't affect indexing speed all that much.  The way to
> >> > increase indexing speed is to increase the number of processes or
> threads
> >> > that are indexing at the same time.  Instead of having one client
> >> > sending update
> >> > requests, try five of them."
> >> >
> >> > Can you elaborate on this some?  I'm worried I might be
> misunderstanding
> >> > something fundamental.  A cluster of 3 shards over 3 Solr nodes
> >> > *should* support
> >> > a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
> >> idea
> >> > behind sharding.  Regarding your comment of "increase the number of
> >> > processes or threads", note that for each value of K (number of Solr
> >> nodes)
> >> > I measured with several different numbers of simulated users so that I
> >> > could find a "saturation point".  For example, take a look at my data
> for
> >> > K=2:
> >> >
> >> > Num NodesNum
> >> UsersQPS214722517902102290215285022029002403210260320028032102
> >> > 1003180
> >> >
> >> > It's clear that once the load test client has ~40 simulated users, the
> >> Solr
> >> > cluster is saturated.  Creating more users just increases the average
> >> > request latency, such that the total QPS remained (nearly) constant.
> So
> >> I
> >> > feel pretty confident that a cluster of size 2 *maxes out* at ~3200
> qps.
> >> > The problem is that I am finding roughly this same "max point", no
> matter
> >> > how many simulated users the load test client created, for any value
> of K
> >> > (> 1).
> >> >
> >> > Cheers,
> >> > - Ian
> >> >
> >> >
> >> > On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >
> >> > wrote:
> >> >
> >> >> Your indexing client, if written in SolrJ, should use CloudSolrServer
> >> >> which is, in Matt's terms "leader aware". It divides up the
> >> >> documents to be indexed into packets that where each doc in
> >> >> the packet belongs on the same shard, and then sends the packet
> >> >> to the shard leader. This avoids a lot of re-routing and should
> >> >> scale essentially linearly. You may have to add more clients
> >> >> though, depending upon who hard the document-generator is
> >> >> working.
> >> >>
> >> >> Also, make sure that you send batches of documents as Shawn
> >> >> suggests, I use 1,000 as a starting point.
> >> >>
> >> >> Best,
> >> >> Erick
> >> >>
> >> >> On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey <apa...@elyograg.org>
> >> wrote:
> >> >> > On 10/30/2014 2:56 PM, Ian Rose wrote:
> >> >> >> I think this is true only for actual queries, right? I am not
> issuing
> >> >> >> any queries, only writes (document inserts). In the case of
> writes,
> >> >> >> increasing the number of shards should increase my throughput (in
> >> >> >> ops/sec) more or less linearly, right?
> >> >> >
> >> >> > No, that won't affect indexing speed all that much.  The way to
> >> increase
> >> >> > indexing speed is to increase the number of processes or threads
> that
> >> >> > are indexing at the same time.  Instead of having one client
> sending
> >> >> > update requests, try five of them.  Also, index many documents with
> >> each
> >> >> > update request.  Sending one document at a time is very
> inefficient.
> >> >> >
> >> >> > You didn't say how you're doing commits, but those need to be as
> >> >> > infrequent as you can manage.  Ideally, you would use autoCommit
> with
> >> >> > openSearcher=false on an interval of about five minutes, and send
> an
> >> >> > explicit commit (with the default openSearcher=true) after all the
> >> >> > indexing is done.
> >> >> >
> >> >> > You may have requirements regarding document visibility that this
> >> won't
> >> >> > satisfy, but try to avoid doing commits with openSearcher=true
> (soft
> >> >> > commits qualify for this) extremely frequently, like once a second.
> >> >> > Once a minute is much more realistic.  Opening a new searcher is an
> >> >> > expensive operation, especially if you have cache warming
> configured.
> >> >> >
> >> >> > Thanks,
> >> >> > Shawn
> >> >> >
> >> >>
> >>
>

Re: Ideas for debugging poor SolrCloud scalability

Reply via email to