Re: how to implement and deploy robust streaming apps

2016-03-08 Thread Xinh Huynh
If you would like an overview of Spark Stream and fault tolerance, these
slides are great (Slides 24+ focus on fault tolerance; Slide 52 is on
resilience to traffic spikes):
http://www.lightbend.com/blog/four-things-to-know-about-reliable-spark-streaming-typesafe-databricks

This recent Spark Summit talk is all about backpressure and dynamic
scaling:
https://spark-summit.org/east-2016/events/building-robust-scalable-and-adaptive-applications-on-spark-streaming/

>From the Spark docs, backpressure works by placing a limit on the receiving
rate, and this limit is adjusted dynamically based on processing times. If
there is a burst and the data source generates events at a higher rate,
those extra events will get backed up in the data source. So, how much
buffering is available in the data source? For instance, Kafka can use HDFS
as a huge buffer, with capacity to buffer traffic spikes. Spark itself
doesn't handle the buffering of unprocessed events, so in some cases, Kafka
(or some other storage) is placed between the data source and Spark to
provide a buffer.

Xinh


On Mon, Mar 7, 2016 at 2:10 PM, Andy Davidson  wrote:

> One of the challenges we need to prepare for with streaming apps is bursty
> data. Typically we need to estimate our worst case data load and make sure
> we have enough capacity
>
>
> It not obvious what best practices are with spark streaming.
>
>
>- we have implemented check pointing as described in the prog guide
>- Use stand alone cluster manager and spark-submit
>- We use the mgmt console to kill drives when needed
>
>
>- we plan to configure write ahead spark.streaming.backpressure.enabled
> to true.
>
>
>- our application runs a single unreliable receive
>   - We run multiple implementation configured to partition the input
>
>
> As long as our processing time is < our windowing time everything is fine
>
> In the streaming systems I have worked on in the past we scaled out by
> using load balancers and proxy farms to create buffering capacity. Its not
> clear how to scale out spark
>
> In our limited testing it seems like we have a single app configure to
> receive a predefined portion of the data. Once it is stated we can not add
> additional resources. Adding cores and memory does not seem increase our
> capacity
>
>
> Kind regards
>
> Andy
>
>
>


how to implement and deploy robust streaming apps

2016-03-07 Thread Andy Davidson
One of the challenges we need to prepare for with streaming apps is bursty
data. Typically we need to estimate our worst case data load and make sure
we have enough capacity


It not obvious what best practices are with spark streaming.

* we have implemented check pointing as described in the prog guide
* Use stand alone cluster manager and spark-submit
* We use the mgmt console to kill drives when needed
* we plan to configure write ahead spark.streaming.backpressure.enabled to
true.
* our application runs a single unreliable receive
> * We run multiple implementation configured to partition the input

As long as our processing time is < our windowing time everything is fine

In the streaming systems I have worked on in the past we scaled out by using
load balancers and proxy farms to create buffering capacity. Its not clear
how to scale out spark

In our limited testing it seems like we have a single app configure to
receive a predefined portion of the data. Once it is stated we can not add
additional resources. Adding cores and memory does not seem increase our
capacity 


Kind regards

Andy