These are both really good posts: you should try and get them in to the
documentation.
with anything implementing dynamicness, there are some fun problems
(a) detecting the delays in the workflow. There's some good ideas here
(b) deciding where to address it. That means you need to monitor the
If I want to restart my consumers into an updated cluster topology after
the cluster has been expanded or contracted, would I need to call stop() on
them, then call start() on them, or would I need to instantiate and start
new context objects (new JavaStreamingContext(...)) ? I'm thinking of
Let me try to add some clarity in the different thought directions that's
going on in this thread.
1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES?
If there are not rate limits set up, the most reliable way to detect
whether the current Spark cluster is being insufficient to handle the data
Yes, Tathagata, thank you.
For #1, the 'need detection', one idea we're entertaining is timestamping
the messages coming into the Kafka topics. The consumers would check the
interval between the time they get the message and that message origination
timestamp. As Kafka topics start to fill up
At which point would I call cache()? I just want the runtime to spill to
disk when necessary without me having to know when the necessary is.
On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger c...@koeninger.org wrote:
direct stream isn't a receiver, it isn't required to cache data anywhere
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the storage level can be specified in the createStream methods but
not createDirectStream...
On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote:
You can also try Dynamic Resource Allocation
direct stream isn't a receiver, it isn't required to cache data anywhere
unless you want it to.
If you want it, just call cache.
On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the
Great.
You should monitor vital performance / job clogging stats of the Spark
Streaming Runtime not “kafka topics” -- anything specific you were thinking
of?
On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Makes sense especially if you have a cloud with “infinite”
If we have a hand-off between the older consumer and the newer consumer, I
wonder if we need to manually manage the offsets in Kafka so as not to miss
some messages as the hand-off is happening.
Or if we let the new consumer run for a bit then let the old consumer know
the 'new guy is in town'
Would it be possible to implement Spark autoscaling somewhat along these
lines? --
1. If we sense that a new machine is needed, by watching the data load in
Kafka topic(s), then
2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
and get a machine);
3. Create a shadow/mirror
Makes sense especially if you have a cloud with “infinite” resources / nodes
which allows you to double, triple etc in the background/parallel the resources
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and want
to / can add e.g. 20
I think what we'd want to do is track the ingestion rate in the consumer(s)
via Spark's aggregation functions and such. If we're at a critical level
(load too high / load too low) then we issue a request into our
Provisioning Component to add/remove machines. Once it comes back with an
OK, each
Which would imply that if there was a load manager type of service, it
could signal to the driver(s) that they need to acquiesce, i.e. process
what's at hand and terminate. Then bring up a new machine, then restart
the driver(s)... Same deal with removing machines from the cluster. Send a
signal
Hi all,
As the author of the dynamic allocation feature I can offer a few insights
here.
Gerard's explanation was both correct and concise: dynamic allocation is
not intended to be used in Spark streaming at the moment (1.4 or before).
This is because of two things:
(1) Number of receivers is
Hi,
tl;dr At the moment (with a BIG disclaimer *) elastic scaling of spark
streaming processes is not supported.
*Longer version.*
I assume that you are talking about Spark Streaming as the discussion is
about handing Kafka streaming data.
Then you have two things to consider: the Streaming
Thank you, Gerard.
We're looking at the receiver-less setup with Kafka Spark streaming so I'm
not sure how to apply your comments to that case (not that we have to use
receiver-less but it seems to offer some advantages over the
receiver-based).
As far as the number of Kafka receivers is fixed
@DG; The key metrics should be
- Scheduling delay – its ideal state is to remain constant over time
and ideally be less than the time of the microbatch window
- The average job processing time should remain less than the
micro-batch window
- Number of Lost Jobs
Thanks, Evo. Per the last part of your comment, it sounds like we will
need to implement a job manager which will be in control of starting the
jobs, monitoring the status of the Kafka topic(s), shutting jobs down and
marking them as ones to relaunch, scaling the cluster up/down by
You can always spin new boxes in the background and bring them into the cluster
fold when fully operational and time that with job relaunch and param change
Kafka offsets are mabaged automatically for you by the kafka clients which keep
them in zoomeeper dont worry about that ad long as you
Evo, good points.
On the dynamic resource allocation, I'm surmising this only works within a
particular cluster setup. So it improves the usage of current cluster
resources but it doesn't make the cluster itself elastic. At least, that's
my understanding.
Memory + disk would be good and
Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK – it
will be your insurance policy against sys crashes due to memory leaks. Until
there is free RAM, spark streaming (spark) will NOT resort to disk – and of
course resorting to disk from time to time (ie when there is no
bq. detect the presence of a new node and start utilizing it
My understanding is that Spark is concerned with managing executors.
Whether request for an executor is fulfilled on an existing node or a new
node is up to the underlying cluster manager (YARN e.g.).
Assuming the cluster is single
Dell - Internal Use - Confidential
Did you check https://drive.google.com/file/d/0B7tmGAdbfMI2OXl6azYySk5iTGM/edit
and
http://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation
Not sure if the spark kafka receiver emits metrics on the lag, check this link
out
Hi,
I'm trying to understand if there are design patterns for autoscaling Spark
(add/remove slave machines to the cluster) based on the throughput.
Assuming we can throttle Spark consumers, the respective Kafka topics we
stream data from would start growing. What are some of the ways to
24 matches
Mail list logo