tl;dr: 1. when a supervisor worker OOM, how can I fix it (it’s surely not OOM from my code, and jmap does not work) 2. when a supervisor worker dies, why would all other supervisors workers kill themselves?
I have a simple test topology that I can post into github if needed but basically goes like this: on one side samples.json (multiple lines) a process injecting these into a redis list (in) and trimming at 1 million entries, measuring the messages being pushed and the ones being dropped on the other side a process consuming from another redis (out) and measuring the messages being received on storm I have 5 machines. on each machine I’m running a supervisor, and then spread zookeeper, nimbus, ui and the two redis amongst them. The redis pusher and puller runs next to the redis server, to ensure no bottlenecks there. The supervisor is configured for a single worker, aiming at running everything on the same VM and avoid messages being serialised between processes on the same machine the topology is composed of: redisspout (5*1) that consumes from the queue in and emits batches of 64 byte[] items jsondecode bolt (5*4) that decodes the batches of json and emits batches of Maps process bolt (5*2) that just adds an item to the map (these would later be where the logic would go) jsonencode bolt (5*4) that encodes the maps into json bytes redisbolt (5*1) that pushes those messages into the second redis All bolts are configured for localOrShuffle to aim at keeping data on the same machine. The spout generates a message id, and all bolts ack the message as requested. Usually everything works fine for several minutes until one supervisor randomly dies with OOM. And when it dies, until it gets restarted, the other 4 supervisors start logging “nettty reconnecting to <dead machine>” and after the 30 retries, they commit suicide. So for about 5-10 minutes they all struggle to start, without anything happening. Problem 1: I’m not allocating any relevant memory on my side. The OOM can only be caused by the storm queues filling up. I have this feeling because without using the messageid and ack Storm would die almost instantly, overwhelmed with all the data I was injecting into it. The problem is that if I run jmap to try to get a heap dump, jmap tells me no way, and I’m afraid it’s the clojure screwing it up. I’ve also set the cmd line properties to dump on heap, but there is no dump anywhere, I’m afraid for the same reason - the jvm tries to create the dump and fails. Problem 2: when a worker dies (or if I kill it explicitly), I assumed the other worker would do whatever they needed, together with zookeeper, to rebalance the system. The thing could hang for a while, rebalancing, but would keep working. What I do see is that because every worker connects in mesh to each other, when one dies, they all try to reconnect. After 30 attempts, they fail, and the logs shows a “async loop died” and all workers die. Supervisor will relaunch them and eventually the whole thing gets back working, but that’s several minutes stopped. If a dead worker does relaunch correctly before the 30 seconds, all is ok, but I see several times the worker starting and showing “<id> still hasn’t started”. Any pointer on how to proceed, or am I picking the wrong hammer for my nail? Thanks in advance