Hello,

I have some questions about external workers and the multi-lang protocol. We have a bunch of existing C code for running processing steps over binary data and I'm looking to see how feasible it is to hook it into Storm.


(1) Is it possible to handle binary data with multi-lang? Or is there existing support for hooking C into Storm?

The multi-lang protocol is JSON, so that implies either base64-encoding everything or passing round a URL to where the binary data is stored.

But looking at the source I see that topology.multilang.serializer is pluggable, so perhaps it's possible to make a version using (e.g.) MsgPack? Ah yes:
https://github.com/pystorm/pystorm/issues/5

So maybe there's a C library comparable to pystorm? Or I can use this serializer to talk msgpack to a spawned C process?


(2) Is there a practical maximum size to a tuple? In some cases we have chunks of around 50MB to pass from step to step. Is it reasonable to pass these directly? Or should they be written into some intermediate store like an NFS server?


(3) http://storm.apache.org/documentation/Multilang-protocol.html
"The shell bolt protocol is asynchronous. You will receive tuples on STDIN as soon as they are available"

So just to be clear: it's fine for me to write a multi-threaded external process which handles multiple overlapping requests?

Furthermore: if all the threads are busy, can I simply stop reading from stdin and let the sender block until I'm ready to receive more tuples?


I also have some general questions about the Storm architecture.

(4) http://storm.apache.org/documentation/Concepts.html

" Shuffle grouping: Tuples are randomly distributed across the bolt's tasks in a way such that each bolt is guaranteed to get an equal number of tuples."

Suppose the bolt's tasks are split across two servers, one of which is slower than the other. Does this mean that the slower server will be 100% utilised while the faster servers will have idle periods? Or is there some flow-control mechanism which kicks into play and gives a larger share to the faster servers?

Specifically I am thinking of:
- A heterogenous cluster, where some servers are older and slower than others - A cluster where one server happens to be busier than another (e.g. it is also working on a different topology)

Through googling I found topology.max.spout.pending, so I see there is an overall control of the number of in-flight (unacked) tuples, except for unreliable spouts:
http://stackoverflow.com/questions/24413088/storm-max-spout-pending

But other than that, will the shufflegrouping deal them out as fast as possible into the downstream bolts?


(5) http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

This says that a single thread (executor) can run multiple task instances of the same component.

How does that work? That is, if those multiple tasks are in the same thread, how do they run concurrently? Or if they can't run concurrently, what is the benefit of having multiple tasks in a thread instead of just one task?


(6) How does Storm distribute tasks over workers and servers? For example, suppose spout A connects to bolt B. I have two servers, and I run a topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 4A on one server and 4B on the other, or 2A+2B on both, or something else?


Many thanks,

Brian Candler.

Reply via email to