Hello,
I have some questions about external workers and the multi-lang
protocol. We have a bunch of existing C code for running processing
steps over binary data and I'm looking to see how feasible it is to hook
it into Storm.
(1) Is it possible to handle binary data with multi-lang? Or is there
existing support for hooking C into Storm?
The multi-lang protocol is JSON, so that implies either base64-encoding
everything or passing round a URL to where the binary data is stored.
But looking at the source I see that topology.multilang.serializer is
pluggable, so perhaps it's possible to make a version using (e.g.)
MsgPack? Ah yes:
https://github.com/pystorm/pystorm/issues/5
So maybe there's a C library comparable to pystorm? Or I can use this
serializer to talk msgpack to a spawned C process?
(2) Is there a practical maximum size to a tuple? In some cases we have
chunks of around 50MB to pass from step to step. Is it reasonable to
pass these directly? Or should they be written into some intermediate
store like an NFS server?
(3) http://storm.apache.org/documentation/Multilang-protocol.html
"The shell bolt protocol is asynchronous. You will receive tuples on
STDIN as soon as they are available"
So just to be clear: it's fine for me to write a multi-threaded external
process which handles multiple overlapping requests?
Furthermore: if all the threads are busy, can I simply stop reading from
stdin and let the sender block until I'm ready to receive more tuples?
I also have some general questions about the Storm architecture.
(4) http://storm.apache.org/documentation/Concepts.html
" Shuffle grouping: Tuples are randomly distributed across the bolt's
tasks in a way such that each bolt is guaranteed to get an equal number
of tuples."
Suppose the bolt's tasks are split across two servers, one of which is
slower than the other. Does this mean that the slower server will be
100% utilised while the faster servers will have idle periods? Or is
there some flow-control mechanism which kicks into play and gives a
larger share to the faster servers?
Specifically I am thinking of:
- A heterogenous cluster, where some servers are older and slower than
others
- A cluster where one server happens to be busier than another (e.g. it
is also working on a different topology)
Through googling I found topology.max.spout.pending, so I see there is
an overall control of the number of in-flight (unacked) tuples, except
for unreliable spouts:
http://stackoverflow.com/questions/24413088/storm-max-spout-pending
But other than that, will the shufflegrouping deal them out as fast as
possible into the downstream bolts?
(5)
http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
This says that a single thread (executor) can run multiple task
instances of the same component.
How does that work? That is, if those multiple tasks are in the same
thread, how do they run concurrently? Or if they can't run concurrently,
what is the benefit of having multiple tasks in a thread instead of just
one task?
(6) How does Storm distribute tasks over workers and servers? For
example, suppose spout A connects to bolt B. I have two servers, and I
run a topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get
4A on one server and 4B on the other, or 2A+2B on both, or something else?
Many thanks,
Brian Candler.