I just backported Michael’s fix to be released in 8.5.2 On Fri, May 15, 2020 at 6:38 AM Michael Gibney <mich...@michaelgibney.net> wrote:
> Hi Wei, > SOLR-14471 has been merged, so this issue should be fixed in 8.6. > Thanks for reporting the problem! > Michael > > On Mon, May 11, 2020 at 7:51 PM Wei <weiwan...@gmail.com> wrote: > > > > Thanks Michael! Yes in each shard I have 10 Tlog replicas, no other > type > > of replicas, and each Tlog replica is an individual solr instance on its > > own physical machine. In the jira you mentioned 'when "last place > matches" > > == "first place matches" – e.g. when shards.preference specified matches > > *all* available replicas'. My setting is > > shards.preference=replica.location:local,replica.type:TLOG, > > I also tried just shards.preference=replica.location:local and it still > has > > the issue. Can you explain a bit more? > > > > On Mon, May 11, 2020 at 12:26 PM Michael Gibney < > mich...@michaelgibney.net> > > wrote: > > > > > FYI: https://issues.apache.org/jira/browse/SOLR-14471 > > > Wei, assuming you have only TLOG replicas, your "last place" matches > > > (to which the random fallback ordering would not be applied -- see > > > above issue) would be the same as the "first place" matches selected > > > for executing distributed requests. > > > > > > > > > On Mon, May 11, 2020 at 1:49 PM Michael Gibney > > > <mich...@michaelgibney.net> wrote: > > > > > > > > Wei, probably no need to answer my earlier questions; I think I see > > > > the problem here, and believe it is indeed a bug, introduced in 8.3. > > > > Will file an issue and submit a patch shortly. > > > > Michael > > > > > > > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney > > > > <mich...@michaelgibney.net> wrote: > > > > > > > > > > Hi Wei, > > > > > > > > > > In considering this problem, I'm stumbling a bit on terminology > > > > > (particularly, where you mention "nodes", I think you're referring > to > > > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per > > > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr > > > > > server instances) do you have, and what is the replica placement > like > > > > > across those nodes? What, if any, non-TLOG replicas do you have per > > > > > shard (not that it's necessarily relevant, but just to get a > complete > > > > > picture of the situation)? > > > > > > > > > > If you're able without too much trouble, can you determine what the > > > > > behavior is like on Solr 8.3? (there were different changes > introduced > > > > > to potentially relevant code in 8.3 and 8.4, and knowing whether > the > > > > > behavior you're observing manifests on 8.3 would help narrow down > > > > > where to look for an explanation). > > > > > > > > > > Michael > > > > > > > > > > On Fri, May 8, 2020 at 7:34 PM Wei <weiwan...@gmail.com> wrote: > > > > > > > > > > > > Update: after I remove the shards.preference parameter from > > > > > > solrconfig.xml, issue is gone and internal shard requests are > now > > > > > > balanced. The same parameter works fine with solr 7.6. Still not > > > sure of > > > > > > the root cause, but I observed a strange coincidence: the nodes > that > > > are > > > > > > most frequently picked for shard requests are the first node in > each > > > shard > > > > > > returned from the CLUSTERSTATUS api. Seems something wrong with > > > shuffling > > > > > > equally compared nodes when shards.preference is set. Will > report > > > back if > > > > > > I find more. > > > > > > > > > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei <weiwan...@gmail.com> wrote: > > > > > > > > > > > > > Hi Eric, > > > > > > > > > > > > > > I am measuring the number of shard requests, and it's for query > > > only, no > > > > > > > indexing requests. I have an external load balancer and see > each > > > node > > > > > > > received about the equal number of external queries. However > for > > > the > > > > > > > internal shard queries, the distribution is uneven: 6 nodes > > > (one in > > > > > > > each shard, some of them are leaders and some are non-leaders > ) > > > gets about > > > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of > > > the shard > > > > > > > requests. I checked a few other parameters set: > > > > > > > > > > > > > > -Dsolr.disable.shardsWhitelist=true > > > > > > > shards.preference=replica.location:local,replica.type:TLOG > > > > > > > > > > > > > > Nothing seems to cause the strange behavior. Any suggestions > how > > > to > > > > > > > debug this? > > > > > > > > > > > > > > -Wei > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson < > > > erickerick...@gmail.com> > > > > > > > wrote: > > > > > > > > > > > > > >> Wei: > > > > > > >> > > > > > > >> How are you measuring utilization here? The number of incoming > > > requests > > > > > > >> or CPU? > > > > > > >> > > > > > > >> The leader for each shard are certainly handling all of the > > > indexing > > > > > > >> requests since they’re TLOG replicas, so that’s one thing that > > > might > > > > > > >> skewing your measurements. > > > > > > >> > > > > > > >> Best, > > > > > > >> Erick > > > > > > >> > > > > > > >> > On Apr 27, 2020, at 7:13 PM, Wei <weiwan...@gmail.com> > wrote: > > > > > > >> > > > > > > > >> > Hi everyone, > > > > > > >> > > > > > > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My > > > cloud has 6 > > > > > > >> > shards with 10 TLOG replicas each shard. After upgrade I > > > noticed that > > > > > > >> one > > > > > > >> > of the replicas in each shard is handling most of the > > > distributed shard > > > > > > >> > requests, so 6 nodes are heavily loaded while other nodes > are > > > idle. > > > > > > >> There > > > > > > >> > is no change in shard handler configuration: > > > > > > >> > > > > > > > >> > <shardHandlerFactory name="shardHandlerFactory" class= > > > > > > >> > "HttpShardHandlerFactory"> > > > > > > >> > > > > > > > >> > <int name="socketTimeout">30000</int> > > > > > > >> > > > > > > > >> > <int name="connTimeout">30000</int> > > > > > > >> > > > > > > > >> > <int name="maxConnectionsPerHost">500</int> > > > > > > >> > > > > > > > >> > </shardHandlerFactory> > > > > > > >> > > > > > > > >> > > > > > > > >> > What could cause the unbalanced internal distributed > request? > > > > > > >> > > > > > > > >> > > > > > > > >> > Thanks in advance. > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > Wei > > > > > > >> > > > > > > >> > > > >