Hi Kishor, Thanks for the response.
Behaviour is the same running Pacemaker in the Storm cluster. When the environment is stopped (including pacemaker) in a node in which nimbus is a leader , it starts reassigning workers even though workers are running fine. The issue we are facing is, this behaviour is different than in 1.2.1 (and earlier version) based on which our system was designed.With the same timeout configurations , in 1.2.1 if a Storm env was brought down in any node , the workers doesn't get reassigned There is a separate issue after reassignment , wherein in some cases the reassignment fails ( Supervisor repeatedly tries to launch a worker process whereas its already running at the end resulting in a zombie process). This i guess is a bug which seems to have been fixed in 2.3.0-SNAPSHOT ( https://github.com/apache/storm/commit/07362f566132ff9392f2d320ed6aa4e78509258d ) But , wanted to know if there is any setting which can be changed to fallback on mode of communication between nimbus/supervisor/worker as it was in pre 2.x? (I had sent another email earlier , to which you responded, which is related to this same issue , so will keep it in a single thread - this email) Let me know if you need more info on our cluster setup Regards Surajeet On Wed, May 26, 2021 at 5:48 PM Kishor Patil <[email protected]> wrote: > Hello Surajeet, > > This is a design change in the Storm 2.x as part of reducing Zookeeper > dependency. We route worker heartbeats via Supervisor to nimbus. if > supervisor is down, the worker attempts to send heartbeats directly to > nimbus. In this case, it looks like change in leadership of nimbus could > have taken longer than heartbeat timeouts making new nimbus think of these > workers requiring rescheduling on other supervisor nodes. > Current Storm version allows to fall back to using Pacemaker instead of > Zookeeper if you prefer that option, but it requires pacemaker setup on > cluster to avoid overloading zookeeper. > > Please let me know if you have any further questions. > > -Kishor > > On 2021/05/26 20:22:36, Surajeet Dev <[email protected]> wrote: > > We are currently on Storm 1.2.1 and was in the process of upgrading it to > > Storm 2.2.0 > > Observed the below while upgrading it to 2.2.0: > > > > 1) In a storm cluster (4 nodes) with 8 topologies running ( with a > mapping > > of 1-1 between worker and topologies), when i bring down > nimbus,supervisor > > in one of the node (let's say Node 1, which is not nimbus leader) the > > workers running on that node gets reassigned to other 3, even though it > is > > running on that node (Node 1). So i have 2 worker process for the same > > topology running at the same time ( saw the behaviour with or without > using > > pacemaker). The worker process does get killed when nimbus and supervisor > > is brought up in Node 1 > > > > 2) Observed from worker logs that it sends heartbeat to local supervisor > > and nimbus leader , which with 1.2.1 used to happen using Zookeeper ( i > saw > > this behaviour in 2.2.0 with or without using Pacemaker). > > If i bring down nimbus and supervisor on node where nimbus is a leader, > it > > reassigns worker processes and in some cases leads to zombie worker > > processess ( is not killed when storm kill is executed) > > > > These above behaviour (reassignment of worker) doesn't happen with Storm > > 1.2.1 > > > > Since this is a fundamental design change between 1.x and 2.x , are there > > any documentation which describes it in detail? ( couldn't find from > > Release Notes) > > >
