These slots are controlled by the config property supervisor.slots.ports, right? We only have one node per cluster currently (Nimbus, Supervisor, and Worker processes all run on the same machine).
From: [email protected] At: 06/06/18 10:58:35To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) Cc: [email protected] Subject: Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources In the case of the EvenScheduler it is all of the free slots in the cluster. So it is how ever many slots are on all of the nodes in the cluster that don't have anything scheduled them. It should be proportional to the number of nodes in your cluster. - Bobby On Wed, Jun 6, 2018 at 9:48 AM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <[email protected]> wrote: What determines the number of slots that we want to schedule on Nimbus startup? Is it existing worker processes at the time Nimbus is brought up, or is it a config property like supervisor.slots.ports? From: [email protected] At: 06/06/18 10:37:32To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) , [email protected] Subject: Re: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources The issue is that intervleave-all is a recursive function. https://github.com/apache/storm/blob/e40d213de7067f7d3aa4d4992b81890d8ed6ff31/storm-core/src/clj/org/apache/storm/util.clj#L776-L784 So the depth of the stack trace is the number of slots you want to schedule on * 3 because of how the recursion happens. Sadly in the latest code it is the same, but still in java so it is not * 3, but still bad. https://github.com/apache/storm/blob/3e098f12e2b09d4954eeeaaf807e4ff6006a6929/storm-server/src/main/java/org/apache/storm/utils/ServerUtils.java#L113-L130 So if you want to file a JIRA for us to fix this, that would be great. Even better if you could look at making interleaveAll no longer recursive. Thanks, Bobby On Tue, Jun 5, 2018 at 10:43 PM Mitchell Rathbun (BLOOMBERG/ 731 LEX) <[email protected]> wrote: From: Mitchell Rathbun (BLOOMBERG/ 731 LEX) At: 06/05/18 23:42:02To: Mitchell Rathbun (BLOOMBERG/ 731 LEX ) Subject: Nimbus repeatedly crashing to issue with disk/ZooKeeper resources Recently, our Nimbus crashed with a stack overflow error, and we are having some difficulty determining what the initial cause was. I have attached the stack trace to help with the debugging. This same stack trace occurred every time I ran Nimbus. I then deleted everything in the directory specified by storm.local.dir and removed everything in ZooKeeper under the storm.zookeeper.root path. I was then able to successfully run Nimbus. So this points to there being an issue with the data/state that Nimbus keeps. Has this issue been seen before, and how could the state reach a point that would prevent Nimbus from running at all? Is it possible that there was not enough disk/zk space, even though the logs don't really point to this being the issue?
