Yes, JVM heap settings.

> On May 19, 2020, at 10:59 AM, Wei <weiwan...@gmail.com> wrote:
> 
> Hi Phill,
> 
> What is the RAM config you are referring to, JVM size? How is that related
> to the load balancing, if each node has the same configuration?
> 
> Thanks,
> Wei
> 
> On Mon, May 18, 2020 at 3:07 PM Phill Campbell
> <sirgilli...@yahoo.com.invalid> wrote:
> 
>> In my previous report I was configured to use as much RAM as possible.
>> With that configuration it seemed it was not load balancing.
>> So, I reconfigured and redeployed to use 1/4 the RAM. What a difference
>> for the better!
>> 
>> 10.156.112.50   load average: 13.52, 10.56, 6.46
>> 10.156.116.34   load average: 11.23, 12.35, 9.63
>> 10.156.122.13   load average: 10.29, 12.40, 9.69
>> 
>> Very nice.
>> My tool that tests records RPS. In the “bad” configuration it was less
>> than 1 RPS.
>> NOW it is showing 21 RPS.
>> 
>> 
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>    "status":0,
>>    "QTime":161},
>>  "metrics":{
>>    "solr.core.BTS.shard1.replica_n2":{
>>      "QUERY./select.requestTimes":{
>>        "count":5723,
>>        "meanRate":6.8163888639859085,
>>        "1minRate":11.557013215119536,
>>        "5minRate":8.760356217628159,
>>        "15minRate":4.707624230995833,
>>        "min_ms":0.131545,
>>        "max_ms":388.710848,
>>        "mean_ms":30.300492048215947,
>>        "median_ms":6.336654,
>>        "stddev_ms":51.527164088667035,
>>        "p75_ms":35.427943,
>>        "p95_ms":140.025957,
>>        "p99_ms":230.533099,
>>        "p999_ms":388.710848}}}}
>> 
>> 
>> 
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>    "status":0,
>>    "QTime":11},
>>  "metrics":{
>>    "solr.core.BTS.shard2.replica_n8":{
>>      "QUERY./select.requestTimes":{
>>        "count":6469,
>>        "meanRate":7.502581801189549,
>>        "1minRate":12.211423085368564,
>>        "5minRate":9.445681397767322,
>>        "15minRate":5.216209798637846,
>>        "min_ms":0.154691,
>>        "max_ms":701.657394,
>>        "mean_ms":34.2734699171445,
>>        "median_ms":5.640378,
>>        "stddev_ms":62.27649205954566,
>>        "p75_ms":39.016371,
>>        "p95_ms":156.997982,
>>        "p99_ms":288.883028,
>>        "p999_ms":538.368031}}}}
>> 
>> 
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>> {
>>  "responseHeader":{
>>    "status":0,
>>    "QTime":67},
>>  "metrics":{
>>    "solr.core.BTS.shard3.replica_n16":{
>>      "QUERY./select.requestTimes":{
>>        "count":7109,
>>        "meanRate":7.787524673806184,
>>        "1minRate":11.88519763582083,
>>        "5minRate":9.893315557386755,
>>        "15minRate":5.620178363676527,
>>        "min_ms":0.150887,
>>        "max_ms":472.826462,
>>        "mean_ms":32.184282366621204,
>>        "median_ms":6.977733,
>>        "stddev_ms":55.729908615189196,
>>        "p75_ms":36.655011,
>>        "p95_ms":151.12627,
>>        "p99_ms":251.440162,
>>        "p999_ms":472.826462}}}}
>> 
>> 
>> Compare that to the previous report and you can see the improvement.
>> So, note to myself. Figure out the sweet spot for RAM usage. Use too much
>> and strange behavior is noticed. While using too much all the load focused
>> on one box and query times slowed.
>> I did not see any OOM errors during any of this.
>> 
>> Regards
>> 
>> 
>> 
>>> On May 18, 2020, at 3:23 PM, Phill Campbell
>> <sirgilli...@yahoo.com.INVALID> wrote:
>>> 
>>> I have been testing 8.5.2 and it looks like the load has moved but is
>> still on one machine.
>>> 
>>> Setup:
>>> 3 physical machines.
>>> Each machine hosts 8 instances of Solr.
>>> Each instance of Solr hosts one replica.
>>> 
>>> Another way to say it:
>>> Number of shards = 8. Replication factor = 3.
>>> 
>>> Here is the cluster state. You can see that the leaders are well
>> distributed.
>>> 
>>> {"TEST_COLLECTION":{
>>>   "pullReplicas":"0",
>>>   "replicationFactor":"3",
>>>   "shards":{
>>>     "shard1":{
>>>       "range":"80000000-9fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node3":{
>>>           "core":"TEST_COLLECTION_shard1_replica_n1",
>>>           "base_url":"http://10.156.122.13:10007/solr";,
>>>           "node_name":"10.156.122.13:10007_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node5":{
>>>           "core":"TEST_COLLECTION_shard1_replica_n2",
>>>           "base_url":"http://10.156.112.50:10002/solr";,
>>>           "node_name":"10.156.112.50:10002_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node7":{
>>>           "core":"TEST_COLLECTION_shard1_replica_n4",
>>>           "base_url":"http://10.156.112.50:10006/solr";,
>>>           "node_name":"10.156.112.50:10006_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard2":{
>>>       "range":"a0000000-bfffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node9":{
>>>           "core":"TEST_COLLECTION_shard2_replica_n6",
>>>           "base_url":"http://10.156.112.50:10003/solr";,
>>>           "node_name":"10.156.112.50:10003_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node11":{
>>>           "core":"TEST_COLLECTION_shard2_replica_n8",
>>>           "base_url":"http://10.156.122.13:10004/solr";,
>>>           "node_name":"10.156.122.13:10004_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node12":{
>>>           "core":"TEST_COLLECTION_shard2_replica_n10",
>>>           "base_url":"http://10.156.116.34:10008/solr";,
>>>           "node_name":"10.156.116.34:10008_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard3":{
>>>       "range":"c0000000-dfffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node15":{
>>>           "core":"TEST_COLLECTION_shard3_replica_n13",
>>>           "base_url":"http://10.156.122.13:10008/solr";,
>>>           "node_name":"10.156.122.13:10008_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node17":{
>>>           "core":"TEST_COLLECTION_shard3_replica_n14",
>>>           "base_url":"http://10.156.116.34:10005/solr";,
>>>           "node_name":"10.156.116.34:10005_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node19":{
>>>           "core":"TEST_COLLECTION_shard3_replica_n16",
>>>           "base_url":"http://10.156.116.34:10002/solr";,
>>>           "node_name":"10.156.116.34:10002_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"}}},
>>>     "shard4":{
>>>       "range":"e0000000-ffffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node20":{
>>>           "core":"TEST_COLLECTION_shard4_replica_n18",
>>>           "base_url":"http://10.156.122.13:10001/solr";,
>>>           "node_name":"10.156.122.13:10001_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node23":{
>>>           "core":"TEST_COLLECTION_shard4_replica_n21",
>>>           "base_url":"http://10.156.116.34:10004/solr";,
>>>           "node_name":"10.156.116.34:10004_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node25":{
>>>           "core":"TEST_COLLECTION_shard4_replica_n22",
>>>           "base_url":"http://10.156.112.50:10001/solr";,
>>>           "node_name":"10.156.112.50:10001_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"}}},
>>>     "shard5":{
>>>       "range":"0-1fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node27":{
>>>           "core":"TEST_COLLECTION_shard5_replica_n24",
>>>           "base_url":"http://10.156.116.34:10007/solr";,
>>>           "node_name":"10.156.116.34:10007_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node29":{
>>>           "core":"TEST_COLLECTION_shard5_replica_n26",
>>>           "base_url":"http://10.156.122.13:10006/solr";,
>>>           "node_name":"10.156.122.13:10006_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node31":{
>>>           "core":"TEST_COLLECTION_shard5_replica_n28",
>>>           "base_url":"http://10.156.116.34:10006/solr";,
>>>           "node_name":"10.156.116.34:10006_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"}}},
>>>     "shard6":{
>>>       "range":"20000000-3fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node33":{
>>>           "core":"TEST_COLLECTION_shard6_replica_n30",
>>>           "base_url":"http://10.156.122.13:10002/solr";,
>>>           "node_name":"10.156.122.13:10002_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node35":{
>>>           "core":"TEST_COLLECTION_shard6_replica_n32",
>>>           "base_url":"http://10.156.112.50:10008/solr";,
>>>           "node_name":"10.156.112.50:10008_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node37":{
>>>           "core":"TEST_COLLECTION_shard6_replica_n34",
>>>           "base_url":"http://10.156.116.34:10003/solr";,
>>>           "node_name":"10.156.116.34:10003_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard7":{
>>>       "range":"40000000-5fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node39":{
>>>           "core":"TEST_COLLECTION_shard7_replica_n36",
>>>           "base_url":"http://10.156.122.13:10003/solr";,
>>>           "node_name":"10.156.122.13:10003_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node41":{
>>>           "core":"TEST_COLLECTION_shard7_replica_n38",
>>>           "base_url":"http://10.156.122.13:10005/solr";,
>>>           "node_name":"10.156.122.13:10005_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node43":{
>>>           "core":"TEST_COLLECTION_shard7_replica_n40",
>>>           "base_url":"http://10.156.112.50:10004/solr";,
>>>           "node_name":"10.156.112.50:10004_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}},
>>>     "shard8":{
>>>       "range":"60000000-7fffffff",
>>>       "state":"active",
>>>       "replicas":{
>>>         "core_node45":{
>>>           "core":"TEST_COLLECTION_shard8_replica_n42",
>>>           "base_url":"http://10.156.112.50:10007/solr";,
>>>           "node_name":"10.156.112.50:10007_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"},
>>>         "core_node47":{
>>>           "core":"TEST_COLLECTION_shard8_replica_n44",
>>>           "base_url":"http://10.156.112.50:10005/solr";,
>>>           "node_name":"10.156.112.50:10005_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false",
>>>           "leader":"true"},
>>>         "core_node48":{
>>>           "core":"TEST_COLLECTION_shard8_replica_n46",
>>>           "base_url":"http://10.156.116.34:10001/solr";,
>>>           "node_name":"10.156.116.34:10001_solr",
>>>           "state":"active",
>>>           "type":"NRT",
>>>           "force_set_state":"false"}}}},
>>>   "router":{"name":"compositeId"},
>>>   "maxShardsPerNode":"1",
>>>   "autoAddReplicas":"false",
>>>   "nrtReplicas":"3",
>>>   "tlogReplicas":"0”}}
>>> 
>>> 
>>> Running TOP on each machine while load tests have been running for 60
>> minutes.
>>> 
>>> 10.156.112.50 load average: 0.08, 0.35, 1.65
>>> 10.156.116.34 load average: 24.71, 24.20, 20.65
>>> 10.156.122.13 load average: 5.37, 3.21, 4.04
>>> 
>>> 
>>> 
>>> Here are the stats from each shard leader.
>>> 
>>> 
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":2},
>>> "metrics":{
>>>   "solr.core.BTS.shard1.replica_n2":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":805,
>>>       "meanRate":0.4385455794526838,
>>>       "1minRate":0.5110237122383522,
>>>       "5minRate":0.4671091682458005,
>>>       "15minRate":0.4057871940723353,
>>>       "min_ms":0.14047,
>>>       "max_ms":12424.589645,
>>>       "mean_ms":796.2194458711818,
>>>       "median_ms":10.534906,
>>>       "stddev_ms":2567.655224710497,
>>>       "p75_ms":22.893306,
>>>       "p95_ms":8316.33323,
>>>       "p99_ms":12424.589645,
>>>       "p999_ms":12424.589645}}}}
>>> 
>>> 
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10004/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":2},
>>> "metrics":{
>>>   "solr.core.BTS.shard2.replica_n8":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":791,
>>>       "meanRate":0.4244162938316224,
>>>       "1minRate":0.4869749626003825,
>>>       "5minRate":0.45856412657687656,
>>>       "15minRate":0.3948063845907493,
>>>       "min_ms":0.168369,
>>>       "max_ms":11022.763933,
>>>       "mean_ms":2572.0670957974603,
>>>       "median_ms":1490.222885,
>>>       "stddev_ms":2718.1710938804276,
>>>       "p75_ms":4292.490478,
>>>       "p95_ms":8487.18506,
>>>       "p99_ms":8855.936617,
>>>       "p999_ms":9589.218502}}}}
>>> 
>>> 
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":83},
>>> "metrics":{
>>>   "solr.core.BTS.shard3.replica_n16":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":840,
>>>       "meanRate":0.4335334453288775,
>>>       "1minRate":0.5733683837779382,
>>>       "5minRate":0.4931753679028527,
>>>       "15minRate":0.42241330274699623,
>>>       "min_ms":0.155939,
>>>       "max_ms":18125.516406,
>>>       "mean_ms":7097.942850416767,
>>>       "median_ms":8136.862825,
>>>       "stddev_ms":2382.041897221542,
>>>       "p75_ms":8497.844088,
>>>       "p95_ms":9642.430475,
>>>       "p99_ms":9993.694346,
>>>       "p999_ms":12207.982291}}}}
>>> 
>>> 
>> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10001/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":3},
>>> "metrics":{
>>>   "solr.core.BTS.shard4.replica_n22":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":873,
>>>       "meanRate":0.43420303985137254,
>>>       "1minRate":0.4284437786865815,
>>>       "5minRate":0.44020640429418745,
>>>       "15minRate":0.40860871277629196,
>>>       "min_ms":0.136658,
>>>       "max_ms":11345.407699,
>>>       "mean_ms":511.28573906464504,
>>>       "median_ms":9.063677,
>>>       "stddev_ms":2038.8104673512248,
>>>       "p75_ms":20.270605,
>>>       "p95_ms":8418.131442,
>>>       "p99_ms":8904.78616,
>>>       "p999_ms":10447.78365}}}}
>>> 
>>> 
>> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.116.34:10006/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":4},
>>> "metrics":{
>>>   "solr.core.BTS.shard5.replica_n28":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":863,
>>>       "meanRate":0.4419375762840668,
>>>       "1minRate":0.44487242228317025,
>>>       "5minRate":0.45927613542085916,
>>>       "15minRate":0.41056066296443494,
>>>       "min_ms":0.158855,
>>>       "max_ms":16669.411989,
>>>       "mean_ms":6513.057114006753,
>>>       "median_ms":8033.386692,
>>>       "stddev_ms":3002.7487311308896,
>>>       "p75_ms":8446.147616,
>>>       "p95_ms":9888.641316,
>>>       "p99_ms":13624.11926,
>>>       "p999_ms":13624.11926}}}}
>>> 
>>> 
>> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10002/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":2},
>>> "metrics":{
>>>   "solr.core.BTS.shard6.replica_n30":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":893,
>>>       "meanRate":0.43301141185981046,
>>>       "1minRate":0.4011485529441132,
>>>       "5minRate":0.447654905093643,
>>>       "15minRate":0.41489193746842407,
>>>       "min_ms":0.161571,
>>>       "max_ms":14716.828978,
>>>       "mean_ms":2932.212133523417,
>>>       "median_ms":1289.686481,
>>>       "stddev_ms":3426.22045100954,
>>>       "p75_ms":6230.031884,
>>>       "p95_ms":8109.408506,
>>>       "p99_ms":12904.515311,
>>>       "p999_ms":12904.515311}}}}
>>> 
>>> 
>>> 
>>> 
>> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.122.13:10003/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":16},
>>> "metrics":{
>>>   "solr.core.BTS.shard7.replica_n36":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":962,
>>>       "meanRate":0.46572438680661055,
>>>       "1minRate":0.4974893681625287,
>>>       "5minRate":0.49072296556429784,
>>>       "15minRate":0.44138205926188756,
>>>       "min_ms":0.164803,
>>>       "max_ms":12481.82656,
>>>       "mean_ms":2606.899631183513,
>>>       "median_ms":1457.505387,
>>>       "stddev_ms":3083.297183477969,
>>>       "p75_ms":4072.543679,
>>>       "p95_ms":8562.456178,
>>>       "p99_ms":9351.230895,
>>>       "p999_ms":10430.483813}}}}
>>> 
>>> 
>> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>> <
>> http://10.156.112.50:10005/solr/admin/metrics?group=core&prefix=QUERY./select.requestTimes
>>> 
>>> {
>>> "responseHeader":{
>>>   "status":0,
>>>   "QTime":3},
>>> "metrics":{
>>>   "solr.core.BTS.shard8.replica_n44":{
>>>     "QUERY./select.requestTimes":{
>>>       "count":904,
>>>       "meanRate":0.4356001115451976,
>>>       "1minRate":0.42906831311171356,
>>>       "5minRate":0.4651312663377039,
>>>       "15minRate":0.41812847342709225,
>>>       "min_ms":0.089738,
>>>       "max_ms":10857.092832,
>>>       "mean_ms":304.52127270799156,
>>>       "median_ms":7.098736,
>>>       "stddev_ms":1544.5378594679773,
>>>       "p75_ms":15.599817,
>>>       "p95_ms":93.818662,
>>>       "p99_ms":8510.757117,
>>>       "p999_ms":9353.844994}}}}
>>> 
>>> I restart all of the instances on “34” so that there are no leaders on
>> it. The load somewhat goes to the other box.
>>> 
>>> 10.156.112.50 load average: 0.00, 0.16, 0.47
>>> 10.156.116.34 load average: 17.00, 16.16, 17.07
>>> 10.156.122.13 load average: 17.86, 17.49, 14.74
>>> 
>>> Box “50” is still doing nothing AND it is the leader of 4 of the 8
>> shards.
>>> Box “13” is the leader of the remaining 4 shards.
>>> Box “34” is not the leader of any shard.
>>> 
>>> I will continue to test, who knows, it may be something I am doing.
>> Maybe not enough RAM, etc…, so I am definitely leaving this open to the
>> possibility that I am not well configured for 8.5.
>>> 
>>> Regards
>>> 
>>> 
>>> 
>>> 
>>>> On May 16, 2020, at 5:08 PM, Tomás Fernández Löbbe <
>> tomasflo...@gmail.com> wrote:
>>>> 
>>>> I just backported Michael’s fix to be released in 8.5.2
>>>> 
>>>> On Fri, May 15, 2020 at 6:38 AM Michael Gibney <
>> mich...@michaelgibney.net>
>>>> wrote:
>>>> 
>>>>> Hi Wei,
>>>>> SOLR-14471 has been merged, so this issue should be fixed in 8.6.
>>>>> Thanks for reporting the problem!
>>>>> Michael
>>>>> 
>>>>> On Mon, May 11, 2020 at 7:51 PM Wei <weiwan...@gmail.com> wrote:
>>>>>> 
>>>>>> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other
>>>>> type
>>>>>> of replicas, and each Tlog replica is an individual solr instance on
>> its
>>>>>> own physical machine.  In the jira you mentioned 'when "last place
>>>>> matches"
>>>>>> == "first place matches" – e.g. when shards.preference specified
>> matches
>>>>>> *all* available replicas'.   My setting is
>>>>>> shards.preference=replica.location:local,replica.type:TLOG,
>>>>>> I also tried just shards.preference=replica.location:local and it
>> still
>>>>> has
>>>>>> the issue. Can you explain a bit more?
>>>>>> 
>>>>>> On Mon, May 11, 2020 at 12:26 PM Michael Gibney <
>>>>> mich...@michaelgibney.net>
>>>>>> wrote:
>>>>>> 
>>>>>>> FYI: https://issues.apache.org/jira/browse/SOLR-14471
>>>>>>> Wei, assuming you have only TLOG replicas, your "last place" matches
>>>>>>> (to which the random fallback ordering would not be applied -- see
>>>>>>> above issue) would be the same as the "first place" matches selected
>>>>>>> for executing distributed requests.
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, May 11, 2020 at 1:49 PM Michael Gibney
>>>>>>> <mich...@michaelgibney.net> wrote:
>>>>>>>> 
>>>>>>>> Wei, probably no need to answer my earlier questions; I think I see
>>>>>>>> the problem here, and believe it is indeed a bug, introduced in 8.3.
>>>>>>>> Will file an issue and submit a patch shortly.
>>>>>>>> Michael
>>>>>>>> 
>>>>>>>> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
>>>>>>>> <mich...@michaelgibney.net> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Wei,
>>>>>>>>> 
>>>>>>>>> In considering this problem, I'm stumbling a bit on terminology
>>>>>>>>> (particularly, where you mention "nodes", I think you're referring
>>>>> to
>>>>>>>>> "replicas"?). Could you confirm that you have 10 TLOG replicas per
>>>>>>>>> shard, for each of 6 shards? How many *nodes* (i.e., running solr
>>>>>>>>> server instances) do you have, and what is the replica placement
>>>>> like
>>>>>>>>> across those nodes? What, if any, non-TLOG replicas do you have per
>>>>>>>>> shard (not that it's necessarily relevant, but just to get a
>>>>> complete
>>>>>>>>> picture of the situation)?
>>>>>>>>> 
>>>>>>>>> If you're able without too much trouble, can you determine what the
>>>>>>>>> behavior is like on Solr 8.3? (there were different changes
>>>>> introduced
>>>>>>>>> to potentially relevant code in 8.3 and 8.4, and knowing whether
>>>>> the
>>>>>>>>> behavior you're observing manifests on 8.3 would help narrow down
>>>>>>>>> where to look for an explanation).
>>>>>>>>> 
>>>>>>>>> Michael
>>>>>>>>> 
>>>>>>>>> On Fri, May 8, 2020 at 7:34 PM Wei <weiwan...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Update:  after I remove the shards.preference parameter from
>>>>>>>>>> solrconfig.xml,  issue is gone and internal shard requests are
>>>>> now
>>>>>>>>>> balanced. The same parameter works fine with solr 7.6.  Still not
>>>>>>> sure of
>>>>>>>>>> the root cause, but I observed a strange coincidence: the nodes
>>>>> that
>>>>>>> are
>>>>>>>>>> most frequently picked for shard requests are the first node in
>>>>> each
>>>>>>> shard
>>>>>>>>>> returned from the CLUSTERSTATUS api.  Seems something wrong with
>>>>>>> shuffling
>>>>>>>>>> equally compared nodes when shards.preference is set.  Will
>>>>> report
>>>>>>> back if
>>>>>>>>>> I find more.
>>>>>>>>>> 
>>>>>>>>>> On Mon, Apr 27, 2020 at 5:59 PM Wei <weiwan...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi Eric,
>>>>>>>>>>> 
>>>>>>>>>>> I am measuring the number of shard requests, and it's for query
>>>>>>> only, no
>>>>>>>>>>> indexing requests.  I have an external load balancer and see
>>>>> each
>>>>>>> node
>>>>>>>>>>> received about the equal number of external queries. However
>>>>> for
>>>>>>> the
>>>>>>>>>>> internal shard queries,  the distribution is uneven:    6 nodes
>>>>>>> (one in
>>>>>>>>>>> each shard,  some of them are leaders and some are non-leaders
>>>>> )
>>>>>>> gets about
>>>>>>>>>>> 80% of the shard requests, the other 54 nodes gets about 20% of
>>>>>>> the shard
>>>>>>>>>>> requests.   I checked a few other parameters set:
>>>>>>>>>>> 
>>>>>>>>>>> -Dsolr.disable.shardsWhitelist=true
>>>>>>>>>>> shards.preference=replica.location:local,replica.type:TLOG
>>>>>>>>>>> 
>>>>>>>>>>> Nothing seems to cause the strange behavior.  Any suggestions
>>>>> how
>>>>>>> to
>>>>>>>>>>> debug this?
>>>>>>>>>>> 
>>>>>>>>>>> -Wei
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson <
>>>>>>> erickerick...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Wei:
>>>>>>>>>>>> 
>>>>>>>>>>>> How are you measuring utilization here? The number of incoming
>>>>>>> requests
>>>>>>>>>>>> or CPU?
>>>>>>>>>>>> 
>>>>>>>>>>>> The leader for each shard are certainly handling all of the
>>>>>>> indexing
>>>>>>>>>>>> requests since they’re TLOG replicas, so that’s one thing that
>>>>>>> might
>>>>>>>>>>>> skewing your measurements.
>>>>>>>>>>>> 
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Erick
>>>>>>>>>>>> 
>>>>>>>>>>>>> On Apr 27, 2020, at 7:13 PM, Wei <weiwan...@gmail.com>
>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I have a strange issue after upgrade from 7.6.0 to 8.4.1. My
>>>>>>> cloud has 6
>>>>>>>>>>>>> shards with 10 TLOG replicas each shard.  After upgrade I
>>>>>>> noticed that
>>>>>>>>>>>> one
>>>>>>>>>>>>> of the replicas in each shard is handling most of the
>>>>>>> distributed shard
>>>>>>>>>>>>> requests, so 6 nodes are heavily loaded while other nodes
>>>>> are
>>>>>>> idle.
>>>>>>>>>>>> There
>>>>>>>>>>>>> is no change in shard handler configuration:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <shardHandlerFactory name="shardHandlerFactory" class=
>>>>>>>>>>>>> "HttpShardHandlerFactory">
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <int name="socketTimeout">30000</int>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <int name="connTimeout">30000</int>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <int name="maxConnectionsPerHost">500</int>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> </shardHandlerFactory>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What could cause the unbalanced internal distributed
>>>>> request?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Wei
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> 
>> 
>> 

Reply via email to