Hello Surajeet, This is a design change in the Storm 2.x as part of reducing Zookeeper dependency. We route worker heartbeats via Supervisor to nimbus. if supervisor is down, the worker attempts to send heartbeats directly to nimbus. In this case, it looks like change in leadership of nimbus could have taken longer than heartbeat timeouts making new nimbus think of these workers requiring rescheduling on other supervisor nodes. Current Storm version allows to fall back to using Pacemaker instead of Zookeeper if you prefer that option, but it requires pacemaker setup on cluster to avoid overloading zookeeper.
Please let me know if you have any further questions. -Kishor On 2021/05/26 20:22:36, Surajeet Dev <[email protected]> wrote: > We are currently on Storm 1.2.1 and was in the process of upgrading it to > Storm 2.2.0 > Observed the below while upgrading it to 2.2.0: > > 1) In a storm cluster (4 nodes) with 8 topologies running ( with a mapping > of 1-1 between worker and topologies), when i bring down nimbus,supervisor > in one of the node (let's say Node 1, which is not nimbus leader) the > workers running on that node gets reassigned to other 3, even though it is > running on that node (Node 1). So i have 2 worker process for the same > topology running at the same time ( saw the behaviour with or without using > pacemaker). The worker process does get killed when nimbus and supervisor > is brought up in Node 1 > > 2) Observed from worker logs that it sends heartbeat to local supervisor > and nimbus leader , which with 1.2.1 used to happen using Zookeeper ( i saw > this behaviour in 2.2.0 with or without using Pacemaker). > If i bring down nimbus and supervisor on node where nimbus is a leader, it > reassigns worker processes and in some cases leads to zombie worker > processess ( is not killed when storm kill is executed) > > These above behaviour (reassignment of worker) doesn't happen with Storm > 1.2.1 > > Since this is a fundamental design change between 1.x and 2.x , are there > any documentation which describes it in detail? ( couldn't find from > Release Notes) >
