Storm fault tolerance benchmark

Dominik Safaric Sat, 13 Aug 2016 08:02:39 -0700

A few months ago, I've started investigating part of an empirical research 
several stream processing engines, including but not limited to Storm.


As the benchmark should extend the scope further from performance metrics such 
as throughput and latency, I've focused onto fault tolerance as well. In 
particular, the rate of data items lost due to various faults. In this context, 
I have the following set of questions.

The Nimbus daemon is considered as a single point of failure (SPOF). Meaning 
that having the Nimbus down, the use is unable to submit new parts of a 
topology, parts of existing topologies cannot be activated/deactivated nor 
rebalanced. In regard to a failed Nimbus daemon, the questions of mine are: 

If the Nimbus has not been restarted yet, while during that period a Worker’s 
Supervisor fails and the bolt supervising it receives high volume throughput 
data, will the data tuples get lost? 
If a Worker gets down, the Supervisor shall restart it on a different port. If 
no port is available, it will be restarted onto a different Worker. But, 
assuming two Workers exist - one gets down, and the computation is being 
restarted on another Worker already reaching a peak in the resources allocated? 
Will in this case data items be dropped because the Worker might not have 
sufficiently enough resources for the underlying Executors? 

Thirdly, what other fault tolerance scenarios might result to data items being 
lost? 

Thanks a lot in advance!

Storm fault tolerance benchmark

Reply via email to