Hello everybody We at Forter have been advocates of Storm for a while now, we use it in production for a low latency decision-as-a-service engine. We use Python multilang bolts a lot, and have noticed some performance issues with passing large objects into them. Basically what I'm seeing in high load envs is that it takes ~50-60ms to pass a 512kb tuples into shell-bolts (1 way). We're very latency driven here at Forter and we're at a point where we need to do better than that. I played around with the code and did a few benchmarks. I would like to share my experience and conclusions here: maybe help someone looking at the same issue in the future or get some inputs and ideas from your experience. I made the benchmark code available at https://github.com/forter/storm-python-shellbolt-benchmark.
Granted, 512kb is big, but still: 50ms is a rate of only ~10mb/s. And that's roughly the rate for smaller payloads too. That's extremely low considering that piping to another process can easily handle data rates of 1GB/s+ in a true streaming protocol. I wrote a Java app that simulates a storm bolt and transfers data into the bolt at max rate over stdout. All tests were conducted on my machine, consider them ballpark estimates. It seems like the max rate I could squeeze out of the default Python bolt implementation is ~55mb/s with a ~512k tuple input & output (while output is directed to /dev/null, the object involved was always flat). Reading huge buffers into a Python process with no line separations or parsing, I can get a python process to read 2G/s (all tests were done with the Java app, so same write/flush pattern on the writing end). When reading standard input line by line in Python (just reading, still no parsing) I get a throughput of 350mb/s. That seems like a big drop just for line separation considering that every second line written is 512kb. And yet, this is 6x faster than with the bolt logic included. Since serialization&deserialization are mainly what the bolt does, it's clear the serialization is bottlenecking the python process and preventing I/O from being processed. I tried the Node.js bolt implementation too, but got the same results (55mb/s). That was surprising, I expected Node should at least be more efficient at parsing JSONs than python. BTW, removing the JSONification at the ends (emits) of both bolts and retrying showed NodeJS increase to 80mb/s and Python to 115MB/s. Then I benchmarked a Msgpack implementation. I managed to get 400mb/s for the same data with no output serialization (lack of time). ~4x faster than python without output serialization. Since the size passed in bytes was in the ballpark of the JSON representation the increase in performance can only be attributed to msgpack efficiency over JSON. My conclusion from all this is that programming languages that are effectively single threaded are a bad choice for this kind of interprocess communications. If Python/Node had a way of utilizing more CPU cores I could have used it to decrease the latency by parallelizing the read-deserialize/logic/serialize-write steps. That would serve the hight-throughput use case very well, but if latency is what we're concerned with this wouldn't be a huge improvement either. Serialization ops are the bottleneck and our only hope is to make them more efficient. Did anybody else encounter the same issue with multilang bolts? Has anyone got a creative solution for Python bolts that get heavy inputs? Here's how to use the benchmark tool: Passthrough python, reads by line without any logic: java -jar target/shell-bolt-benchmark-1.0-SNAPSHOT.jar json | pv | python > passthrough.py > /dev/null > Default Storm implementation with json parsing and all > java -jar target/shell-bolt-benchmark-1.0-SNAPSHOT.jar json | pv | python > storm.py > /dev/null > Msgpack implementation, no output: java -jar target/shell-bolt-benchmark-1.0-SNAPSHOT.jar msgpack | pv | > python storm_msgpack.py > /dev/null - Re'em Bensimhon
