Multilang, C and binary data

Brian Candler Tue, 22 Mar 2016 03:38:16 -0700

Hello,

I have some questions about external workers and the multi-langprotocol. We have a bunch of existing C code for running processingsteps over binary data and I'm looking to see how feasible it is to hookit into Storm.

(1) Is it possible to handle binary data with multi-lang? Or is thereexisting support for hooking C into Storm?

The multi-lang protocol is JSON, so that implies either base64-encodingeverything or passing round a URL to where the binary data is stored.

But looking at the source I see that topology.multilang.serializer ispluggable, so perhaps it's possible to make a version using (e.g.)MsgPack? Ah yes:

https://github.com/pystorm/pystorm/issues/5

So maybe there's a C library comparable to pystorm? Or I can use thisserializer to talk msgpack to a spawned C process?

(2) Is there a practical maximum size to a tuple? In some cases we havechunks of around 50MB to pass from step to step. Is it reasonable topass these directly? Or should they be written into some intermediatestore like an NFS server?



(3) http://storm.apache.org/documentation/Multilang-protocol.html

"The shell bolt protocol is asynchronous. You will receive tuples onSTDIN as soon as they are available"

So just to be clear: it's fine for me to write a multi-threaded externalprocess which handles multiple overlapping requests?

Furthermore: if all the threads are busy, can I simply stop reading fromstdin and let the sender block until I'm ready to receive more tuples?



I also have some general questions about the Storm architecture.

(4) http://storm.apache.org/documentation/Concepts.html

" Shuffle grouping: Tuples are randomly distributed across the bolt'stasks in a way such that each bolt is guaranteed to get an equal numberof tuples."

Suppose the bolt's tasks are split across two servers, one of which isslower than the other. Does this mean that the slower server will be100% utilised while the faster servers will have idle periods? Or isthere some flow-control mechanism which kicks into play and gives alarger share to the faster servers?


Specifically I am thinking of:

- A heterogenous cluster, where some servers are older and slower thanothers- A cluster where one server happens to be busier than another (e.g. itis also working on a different topology)

Through googling I found topology.max.spout.pending, so I see there isan overall control of the number of in-flight (unacked) tuples, exceptfor unreliable spouts:

http://stackoverflow.com/questions/24413088/storm-max-spout-pending

But other than that, will the shufflegrouping deal them out as fast aspossible into the downstream bolts?

(5)http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

This says that a single thread (executor) can run multiple taskinstances of the same component.

How does that work? That is, if those multiple tasks are in the samethread, how do they run concurrently? Or if they can't run concurrently,what is the benefit of having multiple tasks in a thread instead of justone task?

(6) How does Storm distribute tasks over workers and servers? Forexample, suppose spout A connects to bolt B. I have two servers, and Irun a topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get4A on one server and 4B on the other, or 2A+2B on both, or something else?



Many thanks,

Brian Candler.

Multilang, C and binary data

Reply via email to