Hi Kishore ,

                Thank you for the confirmation , yes we had solved it in 
similar lines and it did work for us (listening on the disconnect event from 
ZK).

                From the double assignment point of view is it an expected 
behavior from Helix and the users to handle the same ? Is there any plans to 
fix the same in future ?

Because what I had observed when the network is flapping helix does handle it 
by calling reset () for the partition(s) from the (disconnect()), then why not 
in this case ?

void 
org<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg>.apache<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache>.helix<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache.helix>.manager<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache.helix.manager>.zk<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache.helix.manager.zk>.ZkHelixConnection<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache.helix.manager.zk%7BZkHelixConnection.java%E2%98%83ZkHelixConnection>.handleStateChanged(KeeperState<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache.helix.manager.zk%7BZkHelixConnection.java%E2%98%83ZkHelixConnection~handleStateChanged~QKeeperState;%E2%98%82KeeperState>
 state) throws 
Exception<eclipse-javadoc:%E2%98%82=helix-core/src%5C/main%5C/java%3Corg.apache.helix.manager.zk%7BZkHelixConnection.java%E2%98%83ZkHelixConnection~handleStateChanged~QKeeperState;%E2%98%82Exception>

if (isFlapping()) {
        LOG.error("helix-connection: " + this + ", sessionId: " + _sessionId
            + " is flapping. diconnect it. " + " maxDisconnectThreshold: "
            + _maxDisconnectThreshold + " disconnects in " + 
_flappingTimeWindowMs + "ms");
        disconnect();
      }



Thanks & Regards,
Subramanian.

Tel: +1 (650) 424 4655

3400 Hillview Ave, Building 4
Palo Alto, CA 94304
www.integral.com<http://www.integral.com/>
[Logo_signature_block]<http://www.integral.com/fxcloud_features/risk_management.html#ym>

NOTICE: This e-mail message and any attachments, which may contain confidential 
information, are to be viewed solely by the intended recipient of Integral 
Development Corp. For further information, please visit 
http://www.integral.com/about/disclaimer.html.



From: kishore g [mailto:[email protected]]
Sent: Wednesday, January 25, 2017 4:45 PM
To: [email protected]
Cc: [email protected]
Subject: Re: Double assignment , when participant is not able to establish 
connection with zookeeper quorum

After few seconds, the participant N1 gets a disconnect event from ZK. At this 
time, schedule a timer task for (30  - X) seconds. 30 is the session timeout 
and X can vary from 0 to 30 depending on how long you are ok to not have a P1 
being down.

When the timer task kicks in and N1 is still disconnected from the cluster, 
assume that this N1 is no longer the owner of P1.

After 30 seconds, Helix will notice that N1 is network partitioned and will 
assign P1 to N2.
This will ensure that there is no overlap.

Will that work for you?


On Wed, Jan 25, 2017 at 4:17 PM, Subramanian Raghunathan 
<[email protected]<mailto:[email protected]>>
 wrote:
Hi ,

Double assignment , when participant is not able to establish connection with 
zookeeper quorum

Following is the  set up.

Version(s) :
                                Helix: 0.7.1
                                Zookeeper:3.3.4

- State Model: OnlineOffline
- Controller (leader elected from one of the cluster nodes)
- Single resources with partitions.
- Full auto rebalancer

-Zookeeper quorum (3 nodes)

When one participant loses the zookeeper connection (It’s not able to connect 
to any of the zookeepers , a typical occurrence we faced was switch failure 
from that rack)

  ---- >  The partition (P1) for which this participant (say Node N1) is online 
is still maintained

Meanwhile since it loses the ephemeral  node in zookeeper , the rebalancer gets 
triggered and it reallocates the partition (P1) to another participant node 
(say Node N2) to become online  @ time T1

                ---- >  After this both N1 and N2 are acting as online for the 
same Partition (P1)

But as soon as participant in (say Node N1) is able to re-establish the 
zookeeper connection  @ time T2
                ---- >  Reset gets called on the partition in participant (say 
Node N1)

Double assignment:
The question here is this an expected behavior that both nodes N1 and N2 could 
be online for the same Partition (P1) between time (T1-T2) ? Any responses on 
the same would be appreciated.

Thanks & Regards,
Subramanian.

3400 Hillview Ave, Building 4
Palo Alto, CA 94304
www.integral.com<http://www.integral.com/>
[Logo_signature_block]<http://www.integral.com/fxcloud_features/risk_management.html#ym>

NOTICE: This e-mail message and any attachments, which may contain confidential 
information, are to be viewed solely by the intended recipient of Integral 
Development Corp. For further information, please visit 
http://www.integral.com/about/disclaimer.html.




Reply via email to