[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581164#comment-16581164
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 2:44 PM:
--

Following set of caches leads to bug in my test :) All this caches unable to 
change their JMX properties after clients connect/disconnect.

I'm still trying to reduce this list, but for now this is final set
 
{code:xml}





























































 



 






















 



 










 


 


































{code}

This is log from test frameworks output:
{code}
Current metric state for cache cache_group_1_028 on node 2: 19
Current metric state for cache cache_group_2_058 on node 2: 32
Current metric state for cache cache_group_5 on node 2: 128
Current metric state for cache cache_group_5 on node 2: 128
Current metric state for cache cache_group_4 on node 2: 512
Current metric state for cache cache_group_4_118 on node 2: 32
Current metric state for cache cache_group_6 on node 2: 64
Current metric state for cache cache_group_2_031 on node 2: 512
Current metric state for cache cache_group_6 on node 2: 64
[17:43:27][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
Current metric state for cache cache_group_2_058 on node 2: 32
Current metric state for cache cache_group_5 on node 2: 128
Current metric state for cache cache_group_5 on node 2: 128
Current metric state for cache cache_group_4 on node 2: 512
Current metric state for cache cache_group_4_118 on node 2: 32
Current metric state for cache cache_group_6 on node 2: 64
Current metric state for cache cache_group_2_031 on node 2: 512
Current metric state for cache cache_group_6 on node 2: 64
{code}


was (Author: qvad):
Following set of caches leads to bug in my test :) All this caches unable to 
change their JMX properties after clients connect/disconnect.

I'm still trying to reduce this list, but for now this is final set
 
{code:xml}





























































 



 






















 



 










 


 


































{code}

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
> Attachments: node-2-jstack.log, node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16581164#comment-16581164
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 2:42 PM:
--

Following set of caches leads to bug in my test :) All this caches unable to 
change their JMX properties after clients connect/disconnect.

I'm still trying to reduce this list, but for now this is final set
 
{code:xml}





























































 



 






















 



 










 


 


































{code}


was (Author: qvad):
Following set of caches leads to bug in my test :)

I'm still trying to reduce this list, but for now this is final set
 
{code:xml}














































 



 






















 



 










 


 


































{code}

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
> Attachments: node-2-jstack.log, node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:15 PM:
---

[~Mmuzaf]

Config:
([^node-2-jstack.log]PD: Disabling persistance solves the problem)
{code:java}


























{code}
Test code:
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
self.start_grid() start real grid on distributed servers using ignite.sh 
scripts. with PiClient block start JVM and runs Ignition.start() with client 
config (major difference with server config is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]^[^node-2-jstack.log]^


was (Author: qvad):
[~Mmuzaf]

Config:
{code:java}


























{code}
Test code:
{code}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
self.start_grid() start real grid on distributed servers using ignite.sh 
scripts. with PiClient block start JVM and runs Ignition.start() with client 
config (major difference with server config is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]^[^node-2-jstack.log]^

> 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:15 PM:
---

[~Mmuzaf]

Config:
 (UPD: Disabling persistance solves the problem)
{code:java}


























{code}
Test code:
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
self.start_grid() start real grid on distributed servers using ignite.sh 
scripts. with PiClient block start JVM and runs Ignition.start() with client 
config (major difference with server config is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]^[^node-2-jstack.log]^


was (Author: qvad):
[~Mmuzaf]

Config:
([^node-2-jstack.log]PD: Disabling persistance solves the problem)
{code:java}


























{code}
Test code:
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
self.start_grid() start real grid on distributed servers using ignite.sh 
scripts. with PiClient block start JVM and runs Ignition.start() with client 
config (major difference with server config is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:04 PM:
---

[~Mmuzaf]

Config:
{code:java}


























{code}
Test code:
{code}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
self.start_grid() start real grid on distributed servers using ignite.sh 
scripts. with PiClient block start JVM and runs Ignition.start() with client 
config (major difference with server config is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]^[^node-2-jstack.log]^


was (Author: qvad):
[~Mmuzaf]

Config:
{code:java}


























{code}

Test code:
{code:python}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:sh}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is cancelled if client node joins
> 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:03 PM:
---

[~Mmuzaf]

Config:
{code:java}


























{code}

Test code:
{code:python}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:sh}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:xml}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]


was (Author: qvad):
{code:java}
 {code}
[~Mmuzaf]
{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:02 PM:
---

{code:java}
 {code}
[~Mmuzaf]
{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]


was (Author: qvad):
{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is cancelled if client node joins
> 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:02 PM:
---

{code:java}
 {code}
[~Mmuzaf]
{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]


was (Author: qvad):
{code:java}
 {code}
[~Mmuzaf]
{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 12:01 PM:
---

{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGNITE-7165
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]


was (Author: qvad):
{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGN-9159 (IGNITE-7165)
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is cancelled if client node joins
> 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-15 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16580970#comment-16580970
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/15/18 11:54 AM:
---

{code:java}


























{code}
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGN-9159 (IGNITE-7165)
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}
I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]


was (Author: qvad):
{code:java}
def test_blinking_clients_clean_lfs(self):
"""
IGN-9159 (IGNITE-7165)
"""
self.wait_for_running_clients_num(client_num=0, timeout=120)

self.start_grid() # start 4 nodes

for _ in range(0, 10):
log_print("Iteration %s" % str(_))

self.assert_nodes_alive() # check that no nodes left grid because of 
FailHandler

self.ignite.kill_node(2)
self._cleanup_lfs(2)
self.ignite.start_node(2)

# start Ignition.start() with client config and do nothing 3 times
with PiClient(self.ignite, self.get_client_config()):
pass 

with PiClient(self.ignite, self.get_client_config()):
pass

with PiClient(self.ignite, self.get_client_config()):
pass

# check LocalNodeMovingPartitionsCount metric for all cache groups in 
cluster
# wait that for all cache groups this value will be 0
self.wait_for_finish_rebalance(){code}
Here is code from our test on python. self.start_grid() start real grid on 
distributed servers using ignite.sh scripts. with PiClient block start JVM and 
runs Ignition.start() with client config (major difference with server config 
is clientMode=true)

Log file of this test contains following information: metric dos not change 
their state in 240 seconds in current master. (I've recently check this on 15 
Aug nightly build)
{code:java}
Current metric state for cache cache_group_1_028 on node 2: 19
[14:44:58][:568 :617] Wait rebalance to finish 7/240
Current metric state for cache cache_group_1_028 on node 2: 19
[14:45:04][:568 :617] Wait rebalance to finish 13/240
Current metric state for cache cache_group_1_028 on node 2: 19

[14:48:47][:568 :617] Wait rebalance to finish 236/240
Current metric state for cache cache_group_1_028 on node 2: 19{code}
Config of the cache that fails:
{code:java}


















{code}

I'm afraid that this all information that I can provide for you for now. I've 
attached jstack from node2: [^node-2-jstack.log]

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
>   

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-14 Thread Maxim Muzafarov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579688#comment-16579688
 ] 

Maxim Muzafarov edited comment on IGNITE-7165 at 8/14/18 3:24 PM:
--

[~qvad]

I've checked logs you provided. Actually it's a common case with rebalacing 
procedure and it completes successfully (accorgind your logs).
We have a lot of tests covered this e.g.:
* {{IgniteCacheGroupsTest#testRestartsAndCacheCreateDestroy}} – 10 casches, 10 
nodes (server + client) random put\get operations on caches.
* 
{{CacheLateAffinityAssignmentTest#testConcurrentStartStaticCachesWithClientNodes}}

So, probably, your issue about {{LocalNodeMovingPartitionsCount}} metrics 
propagation with client nodes. 
I can check it additionally, but It very hepls me if you provide info about 
your test suite.


was (Author: mmuzaf):
Dmitry,

I've checked logs you provided. Actually it's a common case with rebalacing 
procedure and it completes successfully (accorgind your logs).
We have a lot of tests covered this e.g.:
* {{IgniteCacheGroupsTest#testRestartsAndCacheCreateDestroy}} – 10 casches, 10 
nodes (server + client) random put\get operations on caches.
* 
{{CacheLateAffinityAssignmentTest#testConcurrentStartStaticCachesWithClientNodes}}

So, probably, your issue about {{LocalNodeMovingPartitionsCount}} metrics 
propagation with client nodes. 
I can check it additionally, but It very hepls me if you provide info about 
your test suite.

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
> Attachments: node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-14 Thread Maxim Muzafarov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579688#comment-16579688
 ] 

Maxim Muzafarov edited comment on IGNITE-7165 at 8/14/18 11:57 AM:
---

Dmitry,

I've checked logs you provided. Actually it's a common case with rebalacing 
procedure and it completes successfully (accorgind your logs).
We have a lot of tests covered this e.g.:
* {{IgniteCacheGroupsTest#testRestartsAndCacheCreateDestroy}} – 10 casches, 10 
nodes (server + client) random put\get operations on caches.
* 
{{CacheLateAffinityAssignmentTest#testConcurrentStartStaticCachesWithClientNodes}}

So, probably, your issue about {{LocalNodeMovingPartitionsCount}} metrics 
propagation with client nodes. 
I can check it additionally, but It very hepls me if you provide info about 
your test suite.


was (Author: mmuzaf):
Dmitry,

I've checked logs you provided. Actually it's a common case with rebalacing 
procedure and it completes successfully (accorgind your logs).
We have a lot of tests covered this e.g.:
* {{IgniteCacheGroupsTest#testRestartsAndCacheCreateDestroy}} – 10 casches, 10 
nodes (server + client) random put\get operations on caches.
* 
{{CacheLateAffinityAssignmentTest#testConcurrentStartStaticCachesWithClientNodes}}

So, probably, your issue about {{LocalNodeMovingPartitionsCount}} metrics 
propagation for client nodes. 
I can check it additionally, but It very hepls me if you provide info about 
your test suite.

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
> Attachments: node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-08-14 Thread Dmitry Sherstobitov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579657#comment-16579657
 ] 

Dmitry Sherstobitov edited comment on IGNITE-7165 at 8/14/18 11:22 AM:
---

For now, I have no reproducer on Java.
 I've investigated persistent store in my test and found that there is 
rebalanced data in storage on the node with cleared LFS, but metrics 
LocalNodeMovingPartitionsCount is definitely broken after client node joins the 
cluster. If I remove the client join event after the node is back - rebalance 
finished correctly.

Here is code from my test log: (Rebalance didn't finish in 240 seconds, while 
in previous versions it's done in 10-15 seconds)

[13:14:17][:568 :617] Wait rebalance to finish 8/240Current metric state for 
cache cache_group_3_088 on node 2: 19
 
 [13:18:04][:568 :617] Wait rebalance to finish 235/240Current metric state for 
cache cache_group_3_088 on node 2: 19

 

P.S. Test runs on a distributed environment, not on a single machine


was (Author: qvad):
For now, I have no reproducer on Java.
I've investigated persistent store in my test and found that there is 
rebalanced data in storage on the node with cleared LFS, but metrics 
LocalNodeMovingPartitionsCount is definitely broken after client node joins the 
cluster. If I remove the client join event after the node is back - rebalance 
finished correctly.

Here is code from my test log: (Rebalance didn't finish in 240 seconds, while 
in previous versions it's done in 10-15 seconds)

[13:14:17][:568 :617] Wait rebalance to finish 8/240Current metric state for 
cache cache_group_3_088 on node 2: 19

[13:18:04][:568 :617] Wait rebalance to finish 235/240Current metric state for 
cache cache_group_3_088 on node 2: 19

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
> Attachments: node-NO_REBALANCE-7165.log
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-07-31 Thread Maxim Muzafarov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16563969#comment-16563969
 ] 

Maxim Muzafarov edited comment on IGNITE-7165 at 7/31/18 4:53 PM:
--

*Improvements created*

IGNITE-9149 – Get rid of logging remaining supplier nodes rebalance time
 IGNITE-9119 – Missed dumpDebugInfo for rebalance


was (Author: mmuzaf):
*Improvements created*

IGNITE-9149 – Get rid of logging remaining supllier nodes rebalance time
 IGNITE-9119 – Missed dumpDebugInfo for rebalance

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=b3a8be53-e61f-4023-a906-a265923837ba, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=f825cb4e-7dcc-405f-a40d-c1dc1a3ade5a, partitionsCount=12, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=4ae1db91-8b88-4180-a84b-127a303959e9, partitionsCount=11, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,708][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=7c286481-7638-49e4-8c68-fa6aa65d8b76, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> so in clusters with a big amount of data and 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-07-27 Thread Maxim Muzafarov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541426#comment-16541426
 ] 

Maxim Muzafarov edited comment on IGNITE-7165 at 7/27/18 7:34 AM:
--

h5. Changes ready
 * TC: 
 * PR: [#4442|https://github.com/apache/ignite/pull/4442]
 * Upsource: 
[IGNT-CR-699|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-699]

h5. Implementation details
 # _Keep topology version to rebalance (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we 
should save version on which rebalance is being currently running.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need 
to stop rebalance for this cache each time new topology version arrived. 
Rebalance should be run only once, except situations when nodes {{LEFT}} or 
{{FAIL}} cluster from which cache partition being demanded for this group.
 # _EMPTY assignments handling_
 Each time {{generateAssignments}} method determind no difference with current 
topology version (return empty map) no matter how affinity changed we should 
return successfull result as fast as possible.
 # _RENTING\EVICTING partiton after PME_
 PME prepares partition to be {{RENTED}} or {{EVICTED}} if they are not assign 
on local node regarding new affinity calculation. Processing stale supply 
message (on previous versions) can lead to exceptions with getting partitions 
on local node with incorrect state. Thats why stale 
{{GridDhtPartitionSupplyMessage}} must be ignored by {{Demander}}.
 # _Clear suppy contex map changed_
 Previously, supply context map have been cleared after each topology version 
change occurs. Since we can preform rebalance not on the latest topology 
version this behavior should be changed. Clear context only for nodes 
left\failed from topology.
 # _{{LEFT}} or {{FAIL}} nodes from cluster (rebalance restart)_
 If rebalance future demand partitions from nodes which have left the cluster 
rebalance must be restarted.
 # _OWNING → MOVING on coordinator due to obsolete partititon update counter_
 Affinity assingment can have no chanes and rebalance is currently running. 
Coordinator performs PME and after megre all SingleMessages marks partitions 
with obsolete update sequence to be demanded from remote nodes (by change 
OWNING -> MOVING partition state). We should schedule new rebalance in this 
case.


was (Author: mmuzaf):
h5. Changes ready
 * TC: [#2722 (14 Jul 18 
19:46)|https://ci.ignite.apache.org/viewLog.html?buildId=1497012=buildResultsDiv=IgniteTests24Java8_RunAll]
 * PR: [#4097|https://github.com/apache/ignite/pull/4097]
 * Upsource: 
[IGNT-CR-670|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-670]

h5. Implementation details
 # _Keep topology version to demand (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we 
should save version on which rebalance is being currently running. Update this 
version from exchange thread after PME will keep us away from unnecessary 
processing of stale supply messages.
 # _{{RebalanceFuture.demanded}} to process cache groups independently_
 We have a long chain for starting rebalance process of cache groups builded by 
{{addAssignments}} method (e.g. {{ignite-sys-cache -> cacheR -> cacheR3 -> 
cacheR2}}). If rebalance started but initial demand message for some groups 
have not been sent (e.g. due to long cleaning\evicting processes previous 
groups) it can be easily cancelled and started new rebalance future.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need 
to stop rebalance for this cache each time new topology version arrived. 
Rebalance should be run only once, except situations when nodes {{LEFT}} or 
{{FAIL}} cluster from which cache partition being demanded for this group.
 # _EMPTY assignments handling_
 Each time {{generateAssignments}} method determind no difference with current 
topology version (return empty map) no matter how affinity changed we should 
return successfull result as fast as possible.
 # _Pengind exchanges handling (cancelled assignments)_
 Exchange thread can have pending exchanges in it's queue 
({{hasPendingExchanges}} method). If such pending changes exists starting new 
rebalance routine has no meaning and we should skip rebalance. This pengind 
changes can cause no affinity assignments partition changes in our case and 
that's why we do not need to cancel current rebalance future.
 # _RENTING\EVICTING partiton after PME_
 PME prepares partition to be {{RENTED}} or {{EVICTED}} if they are not assign 
on local node regarding new affinity calculation. Processing stale supply 
message (on previous versions) can lead to exceptions with getting partitions 
on local node with incorrect state. Thats why 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-07-14 Thread Maxim Muzafarov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541426#comment-16541426
 ] 

Maxim Muzafarov edited comment on IGNITE-7165 at 7/14/18 7:54 PM:
--

h5. Changes ready
 * TC: [#2722 (14 Jul 18 
19:46)|https://ci.ignite.apache.org/viewLog.html?buildId=1497012=buildResultsDiv=IgniteTests24Java8_RunAll]
 * PR: [#4097|https://github.com/apache/ignite/pull/4097]
 * Upsource: 
[IGNT-CR-670|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-670]

h5. Implementation details
 # _Keep topology version to demand (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we 
should save version on which rebalance is being currently running. Update this 
version from exchange thread after PME will keep us away from unnecessary 
processing of stale supply messages.
 # _{{RebalanceFuture.demanded}} to process cache groups independently_
 We have a long chain for starting rebalance process of cache groups builded by 
{{addAssignments}} method (e.g. {{ignite-sys-cache -> cacheR -> cacheR3 -> 
cacheR2}}). If rebalance started but initial demand message for some groups 
have not been sent (e.g. due to long cleaning\evicting processes previous 
groups) it can be easily cancelled and started new rebalance future.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need 
to stop rebalance for this cache each time new topology version arrived. 
Rebalance should be run only once, except situations when nodes {{LEFT}} or 
{{FAIL}} cluster from which cache partition being demanded for this group.
 # _EMPTY assignments handling_
 Each time {{generateAssignments}} method determind no difference with current 
topology version (return empty map) no matter how affinity changed we should 
return successfull result as fast as possible.
 # _Pengind exchanges handling (cancelled assignments)_
 Exchange thread can have pending exchanges in it's queue 
({{hasPendingExchanges}} method). If such pending changes exists starting new 
rebalance routine has no meaning and we should skip rebalance. This pengind 
changes can cause no affinity assignments partition changes in our case and 
that's why we do not need to cancel current rebalance future.
 # _RENTING\EVICTING partiton after PME_
 PME prepares partition to be {{RENTED}} or {{EVICTED}} if they are not assign 
on local node regarding new affinity calculation. Processing stale supply 
message (on previous versions) can lead to exceptions with getting partitions 
on local node with incorrect state. Thats why stale 
{{GridDhtPartitionSupplyMessage}} must be ignored by {{Demander}}.
 # _Clear suppy contex map changed_
 Previously, supply context map have been cleared after each topology version 
change occurs. Since we can preform rebalance not on the latest topology 
version this behavior should be changed. Clear context only for nodes 
left\failed from topology.
 # _{{LEFT}} or {{FAIL}} nodes from cluster (rebalance restart)_
 If rebalance future demand partitions from nodes which have left the cluster 
rebalance must be restarted.
 # _OWNING → MOVING on coordinator due to obsolete partititon update counter_
 Affinity assingment can have no chanes and rebalance is currently running. 
Coordinator performs PME and after megre all SingleMessages marks partitions 
with obsolete update sequence to be demanded from remote nodes (by change 
OWNING -> MOVING partition state). We should schedule new rebalance in this 
case.


was (Author: mmuzaf):
h5. Changes ready
 * TC: [#2636 (11 Jul 18 
21:20)|https://ci.ignite.apache.org/viewLog.html?buildId=1479780=buildResultsDiv=IgniteTests24Java8_RunAll]
 * PR: [#4097|https://github.com/apache/ignite/pull/4097]
 * Upsource: 
[IGNT-CR-670|https://reviews.ignite.apache.org/ignite/review/IGNT-CR-670]

h5. Implementation details
 # _Keep topology version to demand (now it's not the last topology version)_
 To calculate affinity assignment difference with the last topology version we 
should save version on which rebalance is being currently running. Update this 
version from exchange thread after PME will keep us away from unnecessary 
processing of stale supply messages.
 # _{{RebalanceFuture.demanded}} to process cache groups independently_
 We have a long chain for starting rebalance process of cache groups builded by 
{{addAssignments}} method (e.g. {{ignite-sys-cache -> cacheR -> cacheR3 -> 
cacheR2}}). If rebalance started but initial demand message for some groups 
have not been sent (e.g. due to long cleaning\evicting processes previous 
groups) it can be easily cancelled and started new rebalance future.
 # _REPLICATED cache processing_
 Affinity assignment for this type of cache always not changed. We don't need 
to stop rebalance for this cache each time new topology version arrived. 
Rebalance 

[jira] [Comment Edited] (IGNITE-7165) Re-balancing is cancelled if client node joins

2018-07-12 Thread Maxim Muzafarov (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-7165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16541489#comment-16541489
 ] 

Maxim Muzafarov edited comment on IGNITE-7165 at 7/12/18 11:26 AM:
---

I've checked tests (not related to current change):
* IgnitePdsDynamicCacheTest.testRestartAndCreate(fail rate 0,0%)
* 
IgnitePdsCheckpointSimulationWithRealCpDisabledTest.testCheckpointSimulationMultiThreaded
 (fail rate 0,0%) 
* 
GridCachePartitionedDataStructuresFailoverSelfTest.testFairReentrantLockFailsWhenServersLeft
  (fail rate 0,0%) 
* CacheStopAndDestroySelfTest.testClientClose   (fail rate 0,0%)
* CacheStopAndDestroySelfTest.testLocalClose(fail rate 0,0%)
* GridCacheLocalMultithreadedSelfTest.testBasicLocks(fail rate 0,0%) 
* GridCacheLocalMultithreadedSelfTest.testBasicLocks(fail rate 0,0%) 
* IgniteClientReconnectFailoverTest.testReconnectStreamerApi(fail rate 
0,0%) 

 


was (Author: mmuzaf):
Check tests (not related to current change):
* IgnitePdsDynamicCacheTest.testRestartAndCreate(fail rate 0,0%)
* 
IgnitePdsCheckpointSimulationWithRealCpDisabledTest.testCheckpointSimulationMultiThreaded
 (fail rate 0,0%) 
* 
GridCachePartitionedDataStructuresFailoverSelfTest.testFairReentrantLockFailsWhenServersLeft
  (fail rate 0,0%) 
* CacheStopAndDestroySelfTest.testClientClose   (fail rate 0,0%)
* CacheStopAndDestroySelfTest.testLocalClose(fail rate 0,0%)
* GridCacheLocalMultithreadedSelfTest.testBasicLocks(fail rate 0,0%) 
* GridCacheLocalMultithreadedSelfTest.testBasicLocks(fail rate 0,0%) 
* IgniteClientReconnectFailoverTest.testReconnectStreamerApi(fail rate 
0,0%) 

 

> Re-balancing is cancelled if client node joins
> --
>
> Key: IGNITE-7165
> URL: https://issues.apache.org/jira/browse/IGNITE-7165
> Project: Ignite
>  Issue Type: Bug
>Reporter: Mikhail Cherkasov
>Assignee: Maxim Muzafarov
>Priority: Critical
>  Labels: rebalance
> Fix For: 2.7
>
>
> Re-balancing is canceled if client node joins. Re-balancing can take hours 
> and each time when client node joins it starts again:
> [15:10:05,700][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Added new node to topology: TcpDiscoveryNode 
> [id=979cf868-1c37-424a-9ad1-12db501f32ef, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.31.16.213], sockAddrs=[/0:0:0:0:0:0:0:1:0, /127.0.0.1:0, 
> /172.31.16.213:0], discPort=0, order=36, intOrder=24, 
> lastExchangeTime=1512907805688, loc=false, ver=2.3.1#20171129-sha1:4b1ec0fe, 
> isClient=true]
> [15:10:05,701][INFO][disco-event-worker-#61%statement_grid%][GridDiscoveryManager]
>  Topology snapshot [ver=36, servers=7, clients=5, CPUs=128, heap=160.0GB]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Started 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false, evt=NODE_JOINED, evtNode=979cf868-1c37-424a-9ad1-12db501f32ef, 
> customEvt=null, allowMerge=true]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionsExchangeFuture]
>  Finish exchange future [startVer=AffinityTopologyVersion [topVer=36, 
> minorTopVer=0], resVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> err=null]
> [15:10:05,702][INFO][exchange-worker-#62%statement_grid%][time] Finished 
> exchange init [topVer=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> crd=false]
> [15:10:05,703][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=36, minorTopVer=0], evt=NODE_JOINED, 
> node=979cf868-1c37-424a-9ad1-12db501f32ef]
> [15:10:08,706][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Cancelled rebalancing from all nodes [topology=AffinityTopologyVersion 
> [topVer=35, minorTopVer=0]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing scheduled [order=[statementp]]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridCachePartitionExchangeManager]
>  Rebalancing started [top=null, evt=NODE_JOINED, 
> node=a8be3c14-9add-48c3-b099-3fd304cfdbf4]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=2f6bde48-ffb5-4815-bd32-df4e57dc13e0, partitionsCount=18, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
> [15:10:08,707][INFO][exchange-worker-#62%statement_grid%][GridDhtPartitionDemander]
>  Starting rebalancing [mode=ASYNC, 
> fromNode=35d01141-4dce-47dd-adf6-a4f3b2bb9da9, partitionsCount=15, 
> topology=AffinityTopologyVersion [topVer=36, minorTopVer=0], 
> updateSeq=-1754630006]
>