Re: Storm fault tolerance benchmark

Dominik Safaric Sat, 13 Aug 2016 11:50:09 -0700

Hi Andrew,

> Dominik -- re: your first part on the Nimbus as a SPOF, you may want to look 
> into Storm 1.0, which now has a highly-available (HA) nimbus as an option. 
> It's described in technical detail here 
> <http://storm.apache.org/releases/current/nimbus-ha-design.html>.



I’ve read the initial design proposal prior to release only.  

> Re: your second part, if I understand it correctly, what you're really asking 
> about is processing capacity. Most production Storm clusters are 
> over-provisioned a bit,


Indeed Storm clusters are over-provisioned due to the homogeneity assumption.

Besides the mentioned and asked in the initial posting, based on your 
experience are there any other significant fault tolerance limitations of Storm 
that could cause data being dropped? Besides back-pressure and its fast-fail 
strategy for example.  

Dominik

> On 13 Aug 2016, at 19:47, Andrew Montalenti <[email protected]> wrote:
> 
> Dominik -- re: your first part on the Nimbus as a SPOF, you may want to look 
> into Storm 1.0, which now has a highly-available (HA) nimbus as an option. 
> It's described in technical detail here 
> <http://storm.apache.org/releases/current/nimbus-ha-design.html>. As a result 
> of this, there is probably going to be a difference in your analysis for 
> 0.9.x vs 1.x / 2.x
> 
> Re: your second part, if I understand it correctly, what you're really asking 
> about is processing capacity. Most production Storm clusters are 
> over-provisioned a bit, not just to handle load spikes, but also to support 
> failover for Storm worker nodes. You're probably right that if your Storm 
> cluster were at exactly capacity and one worker died, thus causing the 
> cluster to not have enough resources to keep up with the stream, then you 
> might start to hit backpressure limits and drop tuples to the floor (fail 
> them at the spout with timeouts and such). But this is similar to most other 
> distributed systems that need to have "extra resources available" to handle 
> partial node or network failure correctly.
> 
> Does that answer your question?
> 
> --
> Andrew Montalenti | CTO, Parse.ly
> 
> On Sat, Aug 13, 2016 at 11:02 AM, Dominik Safaric <[email protected] 
> <mailto:[email protected]>> wrote:
> A few months ago, I've started investigating part of an empirical research 
> several stream processing engines, including but not limited to Storm. 
> 
> As the benchmark should extend the scope further from performance metrics 
> such as throughput and latency, I've focused onto fault tolerance as well. In 
> particular, the rate of data items lost due to various faults. In this 
> context, I have the following set of questions.
> 
> The Nimbus daemon is considered as a single point of failure (SPOF). Meaning 
> that having the Nimbus down, the use is unable to submit new parts of a 
> topology, parts of existing topologies cannot be activated/deactivated nor 
> rebalanced. In regard to a failed Nimbus daemon, the questions of mine are: 
> 
> If the Nimbus has not been restarted yet, while during that period a Worker’s 
> Supervisor fails and the bolt supervising it receives high volume throughput 
> data, will the data tuples get lost? 
> If a Worker gets down, the Supervisor shall restart it on a different port. 
> If no port is available, it will be restarted onto a different Worker. But, 
> assuming two Workers exist - one gets down, and the computation is being 
> restarted on another Worker already reaching a peak in the resources 
> allocated? Will in this case data items be dropped because the Worker might 
> not have sufficiently enough resources for the underlying Executors? 
> 
> Thirdly, what other fault tolerance scenarios might result to data items 
> being lost? 
> 
> Thanks a lot in advance! 
>

Re: Storm fault tolerance benchmark

Reply via email to