[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-8914: --- Fix Version/s: (was: 6.0) master (7.0) Manually correcting fixVersion per Step #S5 of LUCENE-7271 > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Affects Versions: 5.4.1, 5.5, 6.0 >Reporter: Hoss Man >Assignee: Scott Blum > Fix For: 5.5.1, 5.6, 6.0.1, 6.1, master (7.0) > > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-8914: - Fix Version/s: 5.5.1 > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Affects Versions: 5.5, master, 6.0, 5.4.1 >Reporter: Hoss Man >Assignee: Scott Blum > Fix For: master, 5.5.1, 6.1, 5.6, 6.0.1 > > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-8914: - Fix Version/s: 5.6 6.0.1 > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Affects Versions: 5.5, master, 6.0, 5.4.1 >Reporter: Hoss Man >Assignee: Scott Blum > Fix For: master, 6.1, 5.6, 6.0.1 > > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-8914: - Fix Version/s: 6.1 master > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Affects Versions: 5.5, master, 6.0, 5.4.1 >Reporter: Hoss Man >Assignee: Scott Blum > Fix For: master, 6.1 > > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-8914: - Affects Version/s: 6.0 master 5.5 5.4.1 > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Affects Versions: 5.5, master, 6.0, 5.4.1 >Reporter: Hoss Man >Assignee: Scott Blum > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Miller updated SOLR-8914: -- Assignee: Mark Miller (was: Shalin Shekhar Mangar) Description: Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend {noformat} http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x) Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC {noformat} The failure happened during the static setup of the test, when a MiniSolrCloudCluster & several clients are initialized -- before any code related to TolerantUpdateProcessor is ever used. I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs, so i'm filing this jira with my analysis in the hopes that someone else can help make sense of it. was: Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend {noformat} http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x) Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC {noformat} The failure happened during the static setup of the test, when a MiniSolrCloudCluster & several clients are initialized -- before any code related to TolerantUpdateProcessor is ever used. I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs, so i'm filing this jira with my analysis in the hopes that someone else can help make sense of it. I'll take it. I set my test beasting script on it, and it comes out solid for me. > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man >Assignee: Mark Miller > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-8914: --- Attachment: SOLR-8914.patch i've updated the patch to cleanup the test a bit -- besdies some cosmetic stuff it now does more iterations of smaller "bursts" with more variability in the number of threads used in each burst (which should increase the odds of it failing, eventually, on diff machines regardless of CPU count. bq. I'm beasting your latest patch too, I'll report anything that comes up. Just to make sure, I should be beasting StressTestLiveNodes, right? TestStressLiveNodes, but otherwise yes. It would also be helpful to know if (and how quickly) you can get TestStressLiveNodes to fail on your machine when beasting w/o the rest of the patch (so far i'm the only one that's been able to confirm the bug in practice w/o Scott's patch - hopefully these changes increase those odds) > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > SOLR-8914.patch, jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-8914: - Attachment: SOLR-8914.patch Updated patch, with a refinement of the channel formulation. NOTE: I could not get TestStressLiveNodes to fail for me locally, but I believe this should fix the deadlock. > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man > Attachments: SOLR-8914.patch, SOLR-8914.patch, SOLR-8914.patch, > jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-8914: --- Attachment: SOLR-8914.patch I wrote up a stress test to demonstrate the bug. I've added it to the patch Scott already worked up & attached. Scott: Prior to incorporating your changes, hammering on this stress test would fail within the first 20 attempts. But with your changes I'm seeing deadlocks within the first 5 attempts every time i hammer on it... {noformat} Found one Java-level deadlock: = "zkCallback-7-thread-2-processing-n:127.0.0.1:48312_solr": waiting to lock monitor 0x7f82d40076b8 (object 0xff3b5b38, a java.lang.Object), which is held by "zkCallback-7-thread-1-processing-n:127.0.0.1:48312_solr" "zkCallback-7-thread-1-processing-n:127.0.0.1:48312_solr": waiting to lock monitor 0x7f82d400be38 (object 0xff3b5800, a org.apache.solr.common.cloud.ZkStateReader), which is held by "OverseerStateUpdate-95637266046386179-127.0.0.1:48312_solr-n_00" "OverseerStateUpdate-95637266046386179-127.0.0.1:48312_solr-n_00": waiting to lock monitor 0x7f82d40076b8 (object 0xff3b5b38, a java.lang.Object), which is held by "zkCallback-7-thread-1-processing-n:127.0.0.1:48312_solr" {noformat} > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man > Attachments: SOLR-8914.patch, SOLR-8914.patch, > jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Scott Blum updated SOLR-8914: - Attachment: SOLR-8914.patch Here's a pretty simple patch that I believe should fix. I think #3 is probably the safest option here. > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man > Attachments: SOLR-8914.patch, > jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (SOLR-8914) ZkStateReader's refreshLiveNodes(Watcher) is not thread safe
[ https://issues.apache.org/jira/browse/SOLR-8914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated SOLR-8914: --- Description: Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend {noformat} http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x) Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC {noformat} The failure happened during the static setup of the test, when a MiniSolrCloudCluster & several clients are initialized -- before any code related to TolerantUpdateProcessor is ever used. I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs, so i'm filing this jira with my analysis in the hopes that someone else can help make sense of it. was: Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the weekend {noformat} http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 (refs/remotes/origin/branch_6x) Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC {noformat} The failure happened during the static setup of the test, when a MiniSolrCloudCluster & several clients are initialized -- before any code related to TolerantUpdateProcessor is ever used. I can't reproduce this, or really make sense of what i'm (not) seeing here in the logs, so i'm filing this jira with my analysis in the hopes that someone else can help make sense of it. Summary: ZkStateReader's refreshLiveNodes(Watcher) is not thread safe (was: inexplicable "no servers hosting shard: shard2" using MiniSolrCloudCluster) Updating summary. I suspect we either need to move the {{zkClient.getChildren(...)}} call inside the existing {{synchronized (getUpdateLock())}} block, or the entire {{refreshLiveNodes(Watcher watcher)}} method needs to synchronize on some new "liveNodesLock". > ZkStateReader's refreshLiveNodes(Watcher) is not thread safe > > > Key: SOLR-8914 > URL: https://issues.apache.org/jira/browse/SOLR-8914 > Project: Solr > Issue Type: Bug >Reporter: Hoss Man > Attachments: jenkins.thetaphi.de_Lucene-Solr-6.x-Solaris_32.log.txt, > live_node_mentions_port56361_with_threadIds.log.txt, > live_nodes_mentions.log.txt > > > Jenkin's encountered a failure in TestTolerantUpdateProcessorCloud over the > weekend > {noformat} > http://jenkins.thetaphi.de/job/Lucene-Solr-6.x-Solaris/32/consoleText > Checking out Revision c46d7686643e7503304cb35dfe546bce9c6684e7 > (refs/remotes/origin/branch_6x) > Using Java: 64bit/jdk1.8.0 -XX:+UseCompressedOops -XX:+UseG1GC > {noformat} > The failure happened during the static setup of the test, when a > MiniSolrCloudCluster & several clients are initialized -- before any code > related to TolerantUpdateProcessor is ever used. > I can't reproduce this, or really make sense of what i'm (not) seeing here in > the logs, so i'm filing this jira with my analysis in the hopes that someone > else can help make sense of it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org