Hi,

I've a log processing topology using Storm 0.9.3 with Kafka as Spout and
bolts using ShellBolt with multilang through Pyleus (Python).

Everything runs fine under normal pace (if there is no big lag to be
consumed). But, with any offset lag higher than 100 messages per partition,
this topology stop processing/acking tuples after processing only a few
tuples, around 200 total.

The whole processing is very CPU bound, as I need to parse, roll and
aggregate metrics before sending time series data to Cassandra. By the way,
we are processing big tuples, around 0.8 MB each.

This topology has 1 spout, and 2 bolts, one bolt that aggregates and other
that just commits. Those bolts are connected using a fields grouping.

My storm.yml settings:

worker.childopts: "-Xmx1G -Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=1%ID%
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false -Djava.net.preferIPv4Stack=true"
storm.messaging.netty.max_wait_ms: 3000
storm.messaging.netty.min_wait_ms: 300
storm.messaging.netty.max_retries: 90

storm.messaging.netty.buffer_size: 1048576 #1MB buffer
topology.receiver.buffer.size: 2
topology.transfer.buffer.size: 4
topology.executor.receive.buffer.size: 2
topology.executor.send.buffer.size: 2


I've reduced all buffer configs because of size of my tuples, but that
didn't show any improvements. This topology runs on 2 dedicated servers
with 8 cores and 16 GB of RAM on each. I'm running 12 workers, with
parallelism hint of 12 for the Kafka spout, and 12 for the aggregate bolt
and 12 for commit bolt.

Also, I've set max spout pending to 100 and max shell bolt pending to 1000.
But changing these numbers didn't seem to have any real impact.

After a lot of investigations I've restricted the problem to emitting
tuples from the aggregate bolt to the commit bolt. If I just don't emit any
tuples from the aggregate bolt, everything works fine. I don't known if
this has anything to do with ShellBolt or Pyleus itself.

This screenshot
<https://s3-sa-east-1.amazonaws.com/uploads-br.hipchat.com/114883/858008/SfP6b6YsX4k8jHo/screen.jpg>
shows this topology running during a 5 minutes window with around 150 of
lag on each Kafka partition (Kafka with 12 partitions).

Any thoughts on why this is happening?

-- 
hooray!

--
Victor Godoy Poluceno

Reply via email to