Re: Ideas for debugging poor SolrCloud scalability

Erick Erickson Fri, 31 Oct 2014 07:41:07 -0700

NP, just making sure.

I suspect you'll get lots more bang for the buck, and
results much more closely matching your expectations if


1> you batch up a bunch of docs at once rather than
sending them one at a time. That's probably the easiest
thing to try. Sending docs one at a time is something of
an anti-pattern. I usually start with batches of 1,000.

And just to check.. You're not issuing any commits from the
client, right? Performance will be terrible if you issue commits
after every doc, that's totally an anti-pattern. Doubly so for
optimizes.... Since you showed us your solrconfig  autocommit
settings I'm assuming not but want to be sure.

2> use a leader-aware client. I'm totally unfamiliar with Go,
so I have no suggestions whatsoever to offer there.... But you'll
want to batch in this case too.

On Fri, Oct 31, 2014 at 5:51 AM, Ian Rose <[email protected]> wrote:
> Hi Erick -
>
> Thanks for the detailed response and apologies for my confusing
> terminology.  I should have said "WPS" (writes per second) instead of QPS
> but I didn't want to introduce a weird new acronym since QPS is well
> known.  Clearly a bad decision on my part.  To clarify: I am doing
> *only* writes
> (document adds).  Whenever I wrote "QPS" I was referring to writes.
>
> It seems clear at this point that I should wrap up the code to do "smart"
> routing rather than choose Solr nodes randomly.  And then see if that
> changes things.  I must admit that although I understand that random node
> selection will impose a performance hit, theoretically it seems to me that
> the system should still scale up as you add more nodes (albeit at lower
> absolute level of performance than if you used a smart router).
> Nonetheless, I'm just theorycrafting here so the better thing to do is just
> try it experimentally.  I hope to have that working today - will report
> back on my findings.
>
> Cheers,
> - Ian
>
> p.s. To clarify why we are rolling our own smart router code, we use Go
> over here rather than Java.  Although if we still get bad performance with
> our custom Go router I may try a pure Java load client using
> CloudSolrServer to eliminate the possibility of bugs in our implementation.
>
>
> On Fri, Oct 31, 2014 at 1:37 AM, Erick Erickson <[email protected]>
> wrote:
>
>> I'm really confused:
>>
>> bq: I am not issuing any queries, only writes (document inserts)
>>
>> bq: It's clear that once the load test client has ~40 simulated users
>>
>> bq: A cluster of 3 shards over 3 Solr nodes *should* support
>> a higher QPS than 2 shards over 2 Solr nodes, right
>>
>> QPS is usually used to mean "Queries Per Second", which is different from
>> the statement that "I am not issuing any queries....". And what do the
>> number of users have to do with inserting documents?
>>
>> You also state: " In many cases, CPU on the solr servers is quite low as
>> well"
>>
>> So let's talk about indexing first. Indexing should scale nearly
>> linearly as long as
>> 1> you are routing your docs to the correct leader, which happens with
>> SolrJ
>> and the CloudSolrSever automatically. Rather than rolling your own, I
>> strongly
>> suggest you try this out.
>> 2> you have enough clients feeding the cluster to push CPU utilization
>> on them all.
>> Very often "slow indexing", or in your case "lack of scaling" is a
>> result of document
>> acquisition or, in your case, your doc generator is spending all it's
>> time waiting for
>> the individual documents to get to Solr and come back.
>>
>> bq: "chooses a random solr server for each ADD request (with 1 doc per add
>> request)"
>>
>> Probably your culprit right there. Each and every document requires that
>> you
>> have to cross the network (and forward that doc to the correct leader). So
>> given
>> that you're not seeing high CPU utilization, I suspect that you're not
>> sending
>> enough docs to SolrCloud fast enough to see scaling. You need to batch up
>> multiple docs, I generally send 1,000 docs at a time.
>>
>> But even if you do solve this, the inter-node routing will prevent
>> linear scaling.
>> When a doc (or a batch of docs) goes to a random Solr node, here's what
>> happens:
>> 1> the docs are re-packaged into groups based on which shard they're
>> destined for
>> 2> the sub-packets are forwarded to the leader for each shard
>> 3> the responses are gathered back and returned to the client.
>>
>> This set of operations will eventually degrade the scaling.
>>
>> bq:  A cluster of 3 shards over 3 Solr nodes *should* support
>> a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole idea
>> behind sharding.
>>
>> If we're talking search requests, the answer is no. Sharding is
>> what you do when your collection no longer fits on a single node.
>> If it _does_ fit on a single node, then you'll usually get better query
>> performance by adding a bunch of replicas to a single shard. When
>> the number of  docs on each shard grows large enough that you
>> no longer get good query performance, _then_ you shard. And
>> take the query hit.
>>
>> If we're talking about inserts, then see above. I suspect your problem is
>> that you're _not_ "saturating the SolrCloud cluster", you're sending
>> docs to Solr very inefficiently and waiting on I/O. Batching docs and
>> sending them to the right leader should scale pretty linearly until you
>> start saturating your network.
>>
>> Best,
>> Erick
>>
>> On Thu, Oct 30, 2014 at 6:56 PM, Ian Rose <[email protected]> wrote:
>> > Thanks for the suggestions so for, all.
>> >
>> > 1) We are not using SolrJ on the client (not using Java at all) but I am
>> > working on writing a "smart" router so that we can always send to the
>> > correct node.  I am certainly curious to see how that changes things.
>> > Nonetheless even with the overhead of extra routing hops, the observed
>> > behavior (no increase in performance with more nodes) doesn't make any
>> > sense to me.
>> >
>> > 2) Commits: we are using autoCommit with openSearcher=false
>> (maxTime=60000)
>> > and autoSoftCommit (maxTime=15000).
>> >
>> > 3) Suggestions to batch documents certainly make sense for production
>> code
>> > but in this case I am not real concerned with absolute performance; I
>> just
>> > want to see the *relative* performance as we use more Solr nodes.  So I
>> > don't think batching or not really matters.
>> >
>> > 4) "No, that won't affect indexing speed all that much.  The way to
>> > increase indexing speed is to increase the number of processes or threads
>> > that are indexing at the same time.  Instead of having one client
>> > sending update
>> > requests, try five of them."
>> >
>> > Can you elaborate on this some?  I'm worried I might be misunderstanding
>> > something fundamental.  A cluster of 3 shards over 3 Solr nodes
>> > *should* support
>> > a higher QPS than 2 shards over 2 Solr nodes, right?  That's the whole
>> idea
>> > behind sharding.  Regarding your comment of "increase the number of
>> > processes or threads", note that for each value of K (number of Solr
>> nodes)
>> > I measured with several different numbers of simulated users so that I
>> > could find a "saturation point".  For example, take a look at my data for
>> > K=2:
>> >
>> > Num NodesNum
>> UsersQPS214722517902102290215285022029002403210260320028032102
>> > 1003180
>> >
>> > It's clear that once the load test client has ~40 simulated users, the
>> Solr
>> > cluster is saturated.  Creating more users just increases the average
>> > request latency, such that the total QPS remained (nearly) constant.  So
>> I
>> > feel pretty confident that a cluster of size 2 *maxes out* at ~3200 qps.
>> > The problem is that I am finding roughly this same "max point", no matter
>> > how many simulated users the load test client created, for any value of K
>> > (> 1).
>> >
>> > Cheers,
>> > - Ian
>> >
>> >
>> > On Thu, Oct 30, 2014 at 8:01 PM, Erick Erickson <[email protected]
>> >
>> > wrote:
>> >
>> >> Your indexing client, if written in SolrJ, should use CloudSolrServer
>> >> which is, in Matt's terms "leader aware". It divides up the
>> >> documents to be indexed into packets that where each doc in
>> >> the packet belongs on the same shard, and then sends the packet
>> >> to the shard leader. This avoids a lot of re-routing and should
>> >> scale essentially linearly. You may have to add more clients
>> >> though, depending upon who hard the document-generator is
>> >> working.
>> >>
>> >> Also, make sure that you send batches of documents as Shawn
>> >> suggests, I use 1,000 as a starting point.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Oct 30, 2014 at 2:10 PM, Shawn Heisey <[email protected]>
>> wrote:
>> >> > On 10/30/2014 2:56 PM, Ian Rose wrote:
>> >> >> I think this is true only for actual queries, right? I am not issuing
>> >> >> any queries, only writes (document inserts). In the case of writes,
>> >> >> increasing the number of shards should increase my throughput (in
>> >> >> ops/sec) more or less linearly, right?
>> >> >
>> >> > No, that won't affect indexing speed all that much.  The way to
>> increase
>> >> > indexing speed is to increase the number of processes or threads that
>> >> > are indexing at the same time.  Instead of having one client sending
>> >> > update requests, try five of them.  Also, index many documents with
>> each
>> >> > update request.  Sending one document at a time is very inefficient.
>> >> >
>> >> > You didn't say how you're doing commits, but those need to be as
>> >> > infrequent as you can manage.  Ideally, you would use autoCommit with
>> >> > openSearcher=false on an interval of about five minutes, and send an
>> >> > explicit commit (with the default openSearcher=true) after all the
>> >> > indexing is done.
>> >> >
>> >> > You may have requirements regarding document visibility that this
>> won't
>> >> > satisfy, but try to avoid doing commits with openSearcher=true (soft
>> >> > commits qualify for this) extremely frequently, like once a second.
>> >> > Once a minute is much more realistic.  Opening a new searcher is an
>> >> > expensive operation, especially if you have cache warming configured.
>> >> >
>> >> > Thanks,
>> >> > Shawn
>> >> >
>> >>
>>

Re: Ideas for debugging poor SolrCloud scalability

Reply via email to