Storm OOM and reliability

Bruno D. Rodrigues Thu, 11 Sep 2014 03:03:27 -0700

tl;dr: 
1. when a supervisor worker OOM, how can I fix it (it’s surely not OOM from my 
code, and jmap does not work)
2. when a supervisor worker dies, why would all other supervisors workers kill 
themselves?



I have a simple test topology that I can post into github if needed but 
basically goes like this:

on one side 
samples.json (multiple lines)
a process injecting these into a redis list (in) and trimming at 1 million 
entries, measuring the messages being pushed and the ones being dropped

on the other side
a process consuming from another redis (out) and measuring the messages being 
received

on storm I have 5 machines. on each machine I’m running a supervisor, and then 
spread zookeeper, nimbus, ui and the two redis amongst them. The redis pusher 
and puller runs next to the redis server, to ensure no bottlenecks there.
The supervisor is configured for a single worker, aiming at running everything 
on the same VM and avoid messages being serialised between processes on the 
same machine

the topology is composed of:
redisspout (5*1) that consumes from the queue in and emits batches of 64 byte[] 
items
jsondecode bolt (5*4) that decodes the batches of json and emits batches of Maps
process bolt (5*2) that just adds an item to the map (these would later be 
where the logic would go)
jsonencode bolt (5*4) that encodes the maps into json bytes  
redisbolt (5*1) that pushes those messages into the second redis

All bolts are configured for localOrShuffle to aim at keeping data on the same 
machine.
The spout generates a message id, and all bolts ack the message as requested. 

Usually everything works fine for several minutes until one supervisor randomly 
dies with OOM. And when it dies, until it gets restarted, the other 4 
supervisors start logging “nettty reconnecting to <dead machine>” and after the 
30 retries, they commit suicide. So for about 5-10 minutes they all struggle to 
start, without anything happening.

Problem 1: I’m not allocating any relevant memory on my side. The OOM can only 
be caused by the storm queues filling up. I have this feeling because without 
using the messageid and ack Storm would die almost instantly, overwhelmed with 
all the data I was injecting into it. The problem is that if I run jmap to try 
to get a heap dump, jmap tells me no way, and I’m afraid it’s the clojure 
screwing it up. I’ve also set the cmd line properties to dump on heap, but 
there is no dump anywhere, I’m afraid for the same reason - the jvm tries to 
create the dump and fails.

Problem 2: when a worker dies (or if I kill it explicitly), I assumed the other 
worker would do whatever they needed, together with zookeeper, to rebalance the 
system. The thing could hang for a while, rebalancing, but would keep working.
What I do see is that because every worker connects in mesh to each other, when 
one dies, they all try to reconnect. After 30 attempts, they fail, and the logs 
shows a “async loop died” and all workers die. Supervisor will relaunch them 
and eventually the whole thing gets back working, but that’s several minutes 
stopped. If a dead worker does relaunch correctly before the 30 seconds, all is 
ok, but I see several times the worker starting and showing “<id> still hasn’t 
started”.

Any pointer on how to proceed, or am I picking the wrong hammer for my nail?

Thanks in advance

Storm OOM and reliability

Reply via email to