How to efficiently store the intermediate result of a bolt, and so it can be replayed after the crashes?

Cheng-Kang Hsieh (Andy) Tue, 04 Feb 2014 08:59:05 -0800

Hi all,

First of all, Thank Nathan and all the contributors for pulling out such a
great framework! I am learning a lot, even just reading the discussion
threads.


I am building a topology that contains one spout along with a chain of
bolts. (e.g. S -> A  -> B, where S is the spout, A, B are bolts.)

When S emits a tuple, the next bolt A  will buffer the tuple in a DFS, and
compute some aggregated values when it has received a sufficient amount of
data and then emit the aggregation results to the next bolt B.

Here comes my question, is there a recommended way to store the
intermediate results emitted by a bolt, so that when machine crashes, the
results can be replayed to the downstreaming bolts (i.e. bolt B)?

One possible solution could be that: Don't keep any intermediate results,
but resort to the storm's ack framework, so that the raw data will be
replay from spout S when crash happened.

However, this approach might not be appropriate in my case, as it might
take pretty long time (like a couple of hours) before bolt A has received
all the required data and emit the aggregated results, so that it will be
very expensive for ack framework to keep tracking that many tuples for that
long.

An alternative solution could be: *making bolt A also a spout* and keep the
emitted data in a DFS queue. When a result has been acked, the bolt A
removes it from the queue.

I am wondering if it is reasonable to make a task both bolt and spout at
the same time? or if there is any better approach to do so.

Thank you!

--
Cheng-Kang Hsieh
UCLA Computer Science PhD Student
M: (310) 990-4297
A: 3770 Keystone Ave. Apt 402,
     Los Angeles, CA 90034

How to efficiently store the intermediate result of a bolt, and so it can be replayed after the crashes?

Reply via email to