If you increase the parallelism ten-fold, trippling the number of
workers is probably not enough. You could increase the memory per worker
or scale all your vars (batchSize, maxtasks, numWorkers, parralismHint)
with the same factor for each variable. Hope this helps.
Best regards
Thomas
Am 1/23/2014 4:20 PM, schrieb Jean-Sebastien Vachon:
Hi All,
We've been running storm 0.9.0.1 for some time now in pre-production
and yesterday we started seeing this error in our logs. I've executed
our topology with a reduced number of workers and it is working fine
(5 Spout Pending, batchsize=100, maxtasks=5, numWorkers=25,
parallelismHint=1). However, if I give it more workers and tasks (5
Spout Pending, batchsize=400, maxtasks=40, numWorkers=80,
parallelismHint=10) then the topology seems to hang and starts showing
the error below. This same topology used to run fin with batchsize of
up to 2000.
java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.esotericsoftware.kryo.io.Output.toBytes(Output.java:103)
at
backtype.storm.serialization.KryoTupleSerializer.serialize(KryoTupleSerializer.java:28)
at
backtype.storm.daemon.worker$mk_transfer_fn$fn__5686$fn__5690.invoke(worker.clj:108)
at backtype.storm.util$fast_list_map.invoke(util.clj:801)
at
backtype.storm.daemon.worker$mk_transfer_fn$fn__5686.invoke(worker.clj:108)
at
backtype.storm.daemon.executor$start_batch_transfer__GT_worker_handler_BANG_$fn__3328.invoke(executor.clj:240)
at
backtype.storm.disruptor$clojure_handler$reify__2962.onEvent(disruptor.clj:43)
at
backtype.storm.utils.DisruptorQueue.consumeBatchToCursor(DisruptorQueue.java:87)
at
backtype.storm.utils.DisruptorQueue.consumeBatchWhenAvailable(DisruptorQueue.java:61)
at
backtype.storm.disruptor$consume_batch_when_available.invoke(disruptor.clj:62)
at
backtype.storm.disruptor$consume_loop_STAR_$fn__2975.invoke(disruptor.clj:74)
at backtype.storm.util$async_loop$fn__444.invoke(util.clj:403)
at clojure.lang.AFn.run(AFn.java:24)
at java.lang.Thread.run(Thread.java:662)
Can anyone pinpoint me to some configurations options that could help
resolve this issue? Each worker has 600Mb of RAM as of now, this
amount of memory seemed to be sufficient until yesterday.
Buffers are sized using these values:
topology.executor.receive.buffer.size: 16384
topology.executor.send.buffer.size: 16384
topology.receiver.buffer.size: 8
topology.transfer.buffer.size: 32
Thanks