[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-28 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-12066:

Attachment: SOLR-12066.patch

> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Varun Thacker
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-12066.patch, SOLR-12066.patch
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
> not exist in shard shard1: 
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already 
> put the replica on another JVM . But to the user it's super confusing on 
> what's happening. He can never get rid of this error unless he manually 
> cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
> be using a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-27 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-12066:

Attachment: SOLR-12066.patch

> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Varun Thacker
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-12066.patch
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
> not exist in shard shard1: 
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already 
> put the replica on another JVM . But to the user it's super confusing on 
> what's happening. He can never get rid of this error unless he manually 
> cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
> be using a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-27 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-12066:

Attachment: (was: SOLR-12066)

> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Varun Thacker
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-12066.patch
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
> not exist in shard shard1: 
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already 
> put the replica on another JVM . But to the user it's super confusing on 
> what's happening. He can never get rid of this error unless he manually 
> cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
> be using a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-27 Thread Cao Manh Dat (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cao Manh Dat updated SOLR-12066:

Attachment: SOLR-12066

> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Varun Thacker
>Priority: Major
> Fix For: 7.4, master (8.0)
>
> Attachments: SOLR-12066
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
> not exist in shard shard1: 
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already 
> put the replica on another JVM . But to the user it's super confusing on 
> what's happening. He can never get rid of this error unless he manually 
> cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
> be using a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-13 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-12066:
-
Fix Version/s: master (8.0)
   7.4

> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Varun Thacker
>Priority: Major
> Fix For: 7.4, master (8.0)
>
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
> not exist in shard shard1: 
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already 
> put the replica on another JVM . But to the user it's super confusing on 
> what's happening. He can never get rid of this error unless he manually 
> cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
> be using a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-13 Thread Shalin Shekhar Mangar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shalin Shekhar Mangar updated SOLR-12066:
-
Component/s: SolrCloud
 AutoScaling

> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: AutoScaling, SolrCloud
>Reporter: Varun Thacker
>Priority: Major
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
> org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> ...
> Caused by: org.apache.solr.common.SolrException: Unable to create core 
> [test_node_lost_shard1_replica_n2]
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
> ...
> Caused by: org.apache.solr.common.SolrException: 
> at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
> at 
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
> ...
> Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
> not exist in shard shard1: 
> DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
> ...{code}
>  
> The practical effects of this is not big since the move replica has already 
> put the replica on another JVM . But to the user it's super confusing on 
> what's happening. He can never get rid of this error unless he manually 
> cleans up the data directory on node2 and restart
>  
> Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
> be using a node lost trigger and and run into the same issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-12066) Autoscaling move replica can cause core initialization failure on the original JVM

2018-03-07 Thread Varun Thacker (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-12066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Varun Thacker updated SOLR-12066:
-
Description: 
Initially when SOLR-12047 was created it looked like waiting for a state in ZK 
for only 3 seconds was the culprit for cores not loading up

 

But it turns out to be something else. Here are the steps to reproduce this 
problem

 
 - create a 3 node cluster
 - create a 1 shard X 2 replica collection to use node1 and node2 ( 
[http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
 )
 - stop node 2 : ./bin/solr stop -p 7574
 - Solr will create a new replica on node3 after 30 seconds because of the 
".auto_add_replicas" trigger
 - At this point state.json has info about replicas being on node1 and node3
 - Start node2. Bam!
{code:java}
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core [test_node_lost_shard1_replica_n2]
...
Caused by: org.apache.solr.common.SolrException: Unable to create core 
[test_node_lost_shard1_replica_n2]
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
...
Caused by: org.apache.solr.common.SolrException: 
at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
...
Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
not exist in shard shard1: 
DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
...{code}

 

The practical effects of this is not big since the move replica has already put 
the replica on another JVM . But to the user it's super confusing on what's 
happening. He can never get rid of this error unless he manually cleans up the 
data directory on node2 and restart

 

Please note: I chose autoAddReplicas=true to reproduce this. but a user could 
be using a node lost trigger and and run into the same issue

  was:
Initially when SOLR-12047 was created it looked like waiting for a state in ZK 
for only 3 seconds was the culprit for cores not loading up

 

But it turns out to be something else. Here are the steps to reproduce this 
problem

 
 - create a 3 node cluster
 - create a 1 shard X 2 replica collection to use node1 and node2 ( 
[http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
 )
 - stop node 2 : ./bin/solr stop -p 7574
 - Solr will create a new replica on node3 after 30 seconds because of the 
".auto_add_replicas" trigger
 - At this point state.json has info about replicas being on node1 and node3
 - Start node2. Bam!
{code:java}
java.util.concurrent.ExecutionException: org.apache.solr.common.SolrException: 
Unable to create core [test_node_lost_shard1_replica_n2]
...
Caused by: org.apache.solr.common.SolrException: Unable to create core 
[test_node_lost_shard1_replica_n2]
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1053)
...
Caused by: org.apache.solr.common.SolrException: 
at org.apache.solr.cloud.ZkController.preRegister(ZkController.java:1619)
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1030)
...
Caused by: org.apache.solr.common.SolrException: coreNodeName core_node4 does 
not exist in shard shard1: 
DocCollection(test_node_lost//collections/test_node_lost/state.json/12)={
...{code}

 

The practical effects of this is not big since the move replica has already put 
the replica on another JVM . But to the user it's super confusing on what's 
happening. He can never get rid of this error unless he manually cleans up the 
data directory on node2 and restart


> Autoscaling move replica can cause core initialization failure on the 
> original JVM
> --
>
> Key: SOLR-12066
> URL: https://issues.apache.org/jira/browse/SOLR-12066
> Project: Solr
>  Issue Type: Bug
>  Security Level: Public(Default Security Level. Issues are Public) 
>Reporter: Varun Thacker
>Priority: Major
>
> Initially when SOLR-12047 was created it looked like waiting for a state in 
> ZK for only 3 seconds was the culprit for cores not loading up
>  
> But it turns out to be something else. Here are the steps to reproduce this 
> problem
>  
>  - create a 3 node cluster
>  - create a 1 shard X 2 replica collection to use node1 and node2 ( 
> [http://localhost:8983/solr/admin/collections?action=create=test_node_lost=1=2=true]
>  )
>  - stop node 2 : ./bin/solr stop -p 7574
>  - Solr will create a new replica on node3 after 30 seconds because of the 
> ".auto_add_replicas" trigger
>  - At this point state.json has info about replicas being on node1 and node3
>  - Start node2. Bam!
> {code:java}
> java.util.concurrent.ExecutionException: 
>