Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources

Mitchell Rathbun (BLOOMBERG/ 731 LEX) Wed, 06 Jun 2018 08:16:49 -0700

We set this property to a large number recently. I also reduced the number of 
open slots before deleting on disk/zk state. Just forgot about that step. So I 
think that is what is causing this.

From: [email protected] At: 06/06/18 11:07:01To:  Mitchell Rathbun 
(BLOOMBERG/ 731 LEX ) ,  [email protected]
Subject: Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources

Yes the config is supervisor.slots.ports.  If you only have one node I really 
have no idea why it would think there are so many free slots.

- Bobby

On Wed, Jun 6, 2018 at 10:02 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) 
<[email protected]> wrote:

These slots are controlled by the config property supervisor.slots.ports, 
right? We only have one node per cluster currently (Nimbus, Supervisor, and 
Worker processes all run on the same machine).

From: [email protected] At: 06/06/18 10:58:35
To:  Mitchell Rathbun (BLOOMBERG/ 731 LEX ) 
Cc:  [email protected]

Subject: Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources

In the case of the EvenScheduler it is all of the free slots in the cluster.  
So it is how ever many slots are on all of the nodes in the cluster that don't 
have anything scheduled them.

It should be proportional to the number of nodes in your cluster.

- Bobby

On Wed, Jun 6, 2018 at 9:48 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) 
<[email protected]> wrote:

What determines the number of slots that we want to schedule on Nimbus startup? 
Is it existing worker processes at the time Nimbus is brought up, or is it a 
config property like supervisor.slots.ports?

From: [email protected] At: 06/06/18 10:37:32To:  Mitchell Rathbun (BLOOMBERG/ 
731 LEX ) ,  [email protected]
Subject: Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources

The issue is that intervleave-all is a recursive function.

https://github.com/apache/storm/blob/e40d213de7067f7d3aa4d4992b81890d8ed6ff31/storm-core/src/clj/org/apache/storm/util.clj#L776-L784

So the depth of the stack trace is the number of slots you want to schedule on 
* 3 because of how the recursion happens.

Sadly in the latest code it is the same, but still in java so it is not * 3, 
but still bad.

https://github.com/apache/storm/blob/3e098f12e2b09d4954eeeaaf807e4ff6006a6929/storm-server/src/main/java/org/apache/storm/utils/ServerUtils.java#L113-L130

So if you want to file a JIRA for us to fix this, that would be great.  Even 
better if you could look at making interleaveAll no longer recursive.

Thanks,

Bobby

On Tue, Jun 5, 2018 at 10:43 PM Mitchell Rathbun (BLOOMBERG/ 731 LEX) 
<[email protected]> wrote:

From: Mitchell Rathbun (BLOOMBERG/ 731 LEX) At: 06/05/18 23:42:02To:  Mitchell 
Rathbun (BLOOMBERG/ 731 LEX ) 
Subject: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources
Recently, our Nimbus crashed with a stack overflow error, and we are having 
some difficulty determining what the initial cause was. I have attached the 
stack trace to help with the debugging. This same stack trace occurred every 
time I ran Nimbus. I then deleted everything in the directory specified by 
storm.local.dir and removed everything in ZooKeeper under the 
storm.zookeeper.root path. I was then able to successfully run Nimbus. So this 
points to there being an issue with the data/state that Nimbus keeps. Has this 
issue been seen before, and how could the state reach a point that would 
prevent Nimbus from running at all? Is it possible that there was not enough 
disk/zk space, even though the logs don't really point to this being the issue?

Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources

Reply via email to