Thank you Shawn. It looks like it is being applied. This could be some
sort of chain reaction where:
Drive or server fails. HDFS starts to replicate blocks which causes
network congestion. Solr7 can't talk, so initiates a replication
process which causes more network congestion....which causes more
replicas to replicate, and which eventually causes HBase (we run
HBase+Solr on the same machines) to also not be able to talk. That is
my running hypothesis anyway!
We've made a change to limit how much bandwidth HDFS can use. One issue
that we have seen is that the replicas fail to replicate, and retry,
over and over. I believe they are getting a timeout error; is that
parameter adjustable?
-------------
{
"responseHeader":{
"status":0,
"QTime":134,
"params":{
"echoParams":"all",
"indent":"true",
"wt":"json",
"command":"details",
"maxWriteMBPerSec":"75"}},
"details":{
"indexSize":"156.72 GB",
"indexPath":"hdfs://nameservice1:8020/solr7.1.0/UNCLASS/core_node106/data/index/",
"commits":[[
"indexVersion",1528860019189,
"generation",8188,
"filelist",["_10k8.cfe",
"_10k8.cfs",
"_10k8.si",
"_10k8_1.liv",
"_1l1j.cfe",
"_1l1j.cfs",
"_1l1j.si",
"_1l1j_2.liv",
"_289p.cfe",
"_289p.cfs",
"_289p.si",
"_30fj.cfe",
"_30fj.cfs",
"_30fj.si",
"_30fj_8o.liv",
"_3ugu.cfe",
"_3ugu.cfs",
"_3ugu.si",
"_3uno.cfe",
"_3uno.cfs",
"_3uno.si",
"_3x64.cfe",
"_3x64.cfs",
"_3x64.si",
"_3zt7.cfe",
"_3zt7.cfs",
"_3zt7.si",
"_43mm.cfe",
"_43mm.cfs",
"_43mm.si",
"_43mm_o.liv",
"_487a.cfe",
"_487a.cfs",
"_487a.si",
"_4cxd.cfe",
"_4cxd.cfs",
"_4cxd.si",
"_4eux.cfe",
"_4eux.cfs",
"_4eux.si",
"_4jez.cfe",
"_4jez.cfs",
"_4jez.si",
"_4jez_f.liv",
"_4jgn.cfe",
"_4jgn.cfs",
"_4jgn.si",
"_4jgn_d.liv",
"_4jlm.cfe",
"_4jlm.cfs",
"_4jlm.si",
"_4jlm_9.liv",
"_4jm6.cfe",
"_4jm6.cfs",
"_4jm6.si",
"_4jm6_b.liv",
"_4jmr.cfe",
"_4jmr.cfs",
"_4jmr.si",
"_4jmr_2.liv",
"_4jna.cfe",
"_4jna.cfs",
"_4jna.si",
"_4jna_4.liv",
"_4joy.cfe",
"_4joy.cfs",
"_4joy.si",
"_4joy_5.liv",
"_4jpi.cfe",
"_4jpi.cfs",
"_4jpi.si",
"_4jpi_4.liv",
"_4jq2.cfe",
"_4jq2.cfs",
"_4jq2.si",
"_4jq2_4.liv",
"_4jqm.cfe",
"_4jqm.cfs",
"_4jqm.si",
"_4jqm_1.liv",
"_4jqn.cfe",
"_4jqn.cfs",
"_4jqn.si",
"_4jqn_2.liv",
"_4jqo.cfe",
"_4jqo.cfs",
"_4jqo.si",
"_4jqp.cfe",
"_4jqp.cfs",
"_4jqp.si",
"_4jqq.cfe",
"_4jqq.cfs",
"_4jqq.si",
"_4jqq_1.liv",
"_4jqr.cfe",
"_4jqr.cfs",
"_4jqr.si",
"_4jqs.cfe",
"_4jqs.cfs",
"_4jqs.si",
"_4jqt.cfe",
"_4jqt.cfs",
"_4jqt.si",
"_4jqu.cfe",
"_4jqu.cfs",
"_4jqu.si",
"_4jqv.cfe",
"_4jqv.cfs",
"_4jqv.si",
"_4jqv_1.liv",
"_4jqw.cfe",
"_4jqw.cfs",
"_4jqw.si",
"_4jqw_1.liv",
"_4jqx.cfe",
"_4jqx.cfs",
"_4jqx.si",
"_4jqx_1.liv",
"_4jqy.cfe",
"_4jqy.cfs",
"_4jqy.si",
"_4jqy_1.liv",
"_4jqz.cfe",
"_4jqz.cfs",
"_4jqz.si",
"_4jqz_1.liv",
"_4jr0.cfe",
"_4jr0.cfs",
"_4jr0.si",
"_4jr0_1.liv",
"_4jr1.cfe",
"_4jr1.cfs",
"_4jr1.si",
"_4jr2.cfe",
"_4jr2.cfs",
"_4jr2.si",
"_4jr3.cfe",
"_4jr3.cfs",
"_4jr3.si",
"_4jr3_1.liv",
"_4jr4.cfe",
"_4jr4.cfs",
"_4jr4.si",
"_4jr4_1.liv",
"_4jr5.cfe",
"_4jr5.cfs",
"_4jr5.si",
"_4jr6.cfe",
"_4jr6.cfs",
"_4jr6.si",
"_4jr6_1.liv",
"_4jr7.cfe",
"_4jr7.cfs",
"_4jr7.si",
"_4jr8.cfe",
"_4jr8.cfs",
"_4jr8.si",
"_4jr9.cfe",
"_4jr9.cfs",
"_4jr9.si",
"_4jr9_1.liv",
"_4jra.cfe",
"_4jra.cfs",
"_4jra.si",
"_4jra_1.liv",
"_4jrb.cfe",
"_4jrb.cfs",
"_4jrb.si",
"_4jrb_1.liv",
"_4jrc.cfe",
"_4jrc.cfs",
"_4jrc.si",
"_4jrc_1.liv",
"_4jrd.cfe",
"_4jrd.cfs",
"_4jrd.si",
"_4jre.cfe",
"_4jre.cfs",
"_4jre.si",
"_4jrf.cfe",
"_4jrf.cfs",
"_4jrf.si",
"_4jrg.cfe",
"_4jrg.cfs",
"_4jrg.si",
"_4jrh.cfe",
"_4jrh.cfs",
"_4jrh.si",
"_4jri.cfe",
"_4jri.cfs",
"_4jri.si",
"_4jri_1.liv",
"_4jrj.cfe",
"_4jrj.cfs",
"_4jrj.si",
"_4jrk.cfe",
"_4jrk.cfs",
"_4jrk.si",
"_4jrl.cfe",
"_4jrl.cfs",
"_4jrl.si",
"_itc.cfe",
"_itc.cfs",
"_itc.si",
"_itc_2s.liv",
"segments_6bg"]],
[
"indexVersion",1528861822922,
"generation",8189,
"filelist",["_10k8.cfe",
"_10k8.cfs",
"_10k8.si",
"_10k8_1.liv",
"_1l1j.cfe",
"_1l1j.cfs",
"_1l1j.si",
"_1l1j_2.liv",
"_289p.cfe",
"_289p.cfs",
"_289p.si",
"_30fj.cfe",
"_30fj.cfs",
"_30fj.si",
"_30fj_8o.liv",
"_3ugu.cfe",
"_3ugu.cfs",
"_3ugu.si",
"_3uno.cfe",
"_3uno.cfs",
"_3uno.si",
"_3x64.cfe",
"_3x64.cfs",
"_3x64.si",
"_3zt7.cfe",
"_3zt7.cfs",
"_3zt7.si",
"_43mm.cfe",
"_43mm.cfs",
"_43mm.si",
"_43mm_o.liv",
"_487a.cfe",
"_487a.cfs",
"_487a.si",
"_4cxd.cfe",
"_4cxd.cfs",
"_4cxd.si",
"_4eux.cfe",
"_4eux.cfs",
"_4eux.si",
"_4jez.cfe",
"_4jez.cfs",
"_4jez.si",
"_4jez_f.liv",
"_4jgn.cfe",
"_4jgn.cfs",
"_4jgn.si",
"_4jgn_d.liv",
"_4jlm.cfe",
"_4jlm.cfs",
"_4jlm.si",
"_4jlm_9.liv",
"_4jm6.cfe",
"_4jm6.cfs",
"_4jm6.si",
"_4jm6_c.liv",
"_4jmr.cfe",
"_4jmr.cfs",
"_4jmr.si",
"_4jmr_3.liv",
"_4jna.cfe",
"_4jna.cfs",
"_4jna.si",
"_4jna_5.liv",
"_4joy.cfe",
"_4joy.cfs",
"_4joy.si",
"_4joy_6.liv",
"_4jpi.cfe",
"_4jpi.cfs",
"_4jpi.si",
"_4jpi_4.liv",
"_4jq2.cfe",
"_4jq2.cfs",
"_4jq2.si",
"_4jq2_4.liv",
"_4jqm.cfe",
"_4jqm.cfs",
"_4jqm.si",
"_4jqm_1.liv",
"_4jqn.cfe",
"_4jqn.cfs",
"_4jqn.si",
"_4jqn_2.liv",
"_4jqr.cfe",
"_4jqr.cfs",
"_4jqr.si",
"_4jqu.cfe",
"_4jqu.cfs",
"_4jqu.si",
"_4jqv.cfe",
"_4jqv.cfs",
"_4jqv.si",
"_4jqv_1.liv",
"_4jqw.cfe",
"_4jqw.cfs",
"_4jqw.si",
"_4jqw_1.liv",
"_4jqy.cfe",
"_4jqy.cfs",
"_4jqy.si",
"_4jqy_1.liv",
"_4jqz.cfe",
"_4jqz.cfs",
"_4jqz.si",
"_4jqz_1.liv",
"_4jr0.cfe",
"_4jr0.cfs",
"_4jr0.si",
"_4jr0_1.liv",
"_4jr3.cfe",
"_4jr3.cfs",
"_4jr3.si",
"_4jr3_1.liv",
"_4jr6.cfe",
"_4jr6.cfs",
"_4jr6.si",
"_4jr6_2.liv",
"_4jr8.cfe",
"_4jr8.cfs",
"_4jr8.si",
"_4jr9.cfe",
"_4jr9.cfs",
"_4jr9.si",
"_4jr9_1.liv",
"_4jra.cfe",
"_4jra.cfs",
"_4jra.si",
"_4jra_1.liv",
"_4jrb.cfe",
"_4jrb.cfs",
"_4jrb.si",
"_4jrb_1.liv",
"_4jrd.cfe",
"_4jrd.cfs",
"_4jrd.si",
"_4jre.cfe",
"_4jre.cfs",
"_4jre.si",
"_4jrh.cfe",
"_4jrh.cfs",
"_4jrh.si",
"_4jro.cfe",
"_4jro.cfs",
"_4jro.si",
"_4jrp.cfe",
"_4jrp.cfs",
"_4jrp.si",
"_4jrq.cfe",
"_4jrq.cfs",
"_4jrq.si",
"_4jrr.cfe",
"_4jrr.cfs",
"_4jrr.si",
"_4jrr_1.liv",
"_4jrs.cfe",
"_4jrs.cfs",
"_4jrs.si",
"_4jrt.cfe",
"_4jrt.cfs",
"_4jrt.si",
"_4jru.cfe",
"_4jru.cfs",
"_4jru.si",
"_4jrv.cfe",
"_4jrv.cfs",
"_4jrv.si",
"_4jrw.cfe",
"_4jrw.cfs",
"_4jrw.si",
"_4jrx.cfe",
"_4jrx.cfs",
"_4jrx.si",
"_4jry.cfe",
"_4jry.cfs",
"_4jry.si",
"_4jrz.cfe",
"_4jrz.cfs",
"_4jrz.si",
"_4js0.cfe",
"_4js0.cfs",
"_4js0.si",
"_4js1.cfe",
"_4js1.cfs",
"_4js1.si",
"_4js2.cfe",
"_4js2.cfs",
"_4js2.si",
"_4js3.cfe",
"_4js3.cfs",
"_4js3.si",
"_itc.cfe",
"_itc.cfs",
"_itc.si",
"_itc_2s.liv",
"segments_6bh"]]],
"isMaster":"true",
"isSlave":"false",
"indexVersion":1528861822922,
"generation":8189,
"master":{
"replicateAfter":["commit"],
"replicationEnabled":"true",
"replicableVersion":1528861822922,
"replicableGeneration":8189}}}
-----------------
-Joe
On 6/12/2018 11:48 AM, Shawn Heisey wrote:
On 6/11/2018 9:46 AM, Joe Obernberger wrote:
We are seeing an issue on our Solr Cloud 7.3.1 cluster where
replication starts and pegs network interfaces so aggressively that
other tasks cannot talk. We will see it peg a bonded 2GB interfaces.
In some cases the replication fails over and over until it finally
succeeds and the replica comes back up. Usually the error is a timeout.
Has anyone seen this? We've tried adjust the /replication
requestHandler and setting:
<requestHandler name="/replication" class="solr.ReplicationHandler">
<lst name="defaults">
<str name="maxWriteMBPerSec">75</str>
</lst>
</requestHandler>
Here's something I'd like you to try. Open a browser and visit the URL
for the handler with some specific parameters, so we can see if that
config is actually being applied. Substitute the correct host, port,
and collection name:
http://host:port/solr/collection/replication?command=details&echoParams=all&wt=json&indent=true
And provide the full raw JSON response.
On a solr 7.3.0 example, I added your replication handler definition,
and this is the result of visiting a similar URL:
{
"responseHeader":{
"status":0,
"QTime":5,
"params":{
"echoParams":"all",
"indent":"true",
"wt":"json",
"command":"details",
"maxWriteMBPerSec":"75"}},
"details":{
"indexSize":"6.27 KB",
"indexPath":"C:\\Users\\sheisey\\Downloads\\solr-7.3.0\\server\\solr\\foo\\data\\index/",
"commits":[[
"indexVersion",1528213960436,
"generation",4,
"filelist",["_0.fdt",
"_0.fdx",
"_0.fnm",
"_0.si",
"_0_Lucene50_0.doc",
"_0_Lucene50_0.tim",
"_0_Lucene50_0.tip",
"_0_Lucene70_0.dvd",
"_0_Lucene70_0.dvm",
"_1.fdt",
"_1.fdx",
"_1.fnm",
"_1.nvd",
"_1.nvm",
"_1.si",
"_1_Lucene50_0.doc",
"_1_Lucene50_0.pos",
"_1_Lucene50_0.tim",
"_1_Lucene50_0.tip",
"_1_Lucene70_0.dvd",
"_1_Lucene70_0.dvm",
"_2.fdt",
"_2.fdx",
"_2.fnm",
"_2.nvd",
"_2.nvm",
"_2.si",
"_2_Lucene50_0.doc",
"_2_Lucene50_0.pos",
"_2_Lucene50_0.tim",
"_2_Lucene50_0.tip",
"_2_Lucene70_0.dvd",
"_2_Lucene70_0.dvm",
"segments_4"]]],
"isMaster":"true",
"isSlave":"false",
"indexVersion":1528213960436,
"generation":4,
"master":{
"replicateAfter":["commit"],
"replicationEnabled":"true"}}}
The maxWriteMBPerSec parameter can be seen in the response header, so on
this system, it looks like it's working.
Thanks,
Shawn
---
This email has been checked for viruses by AVG.
https://www.avg.com