If I want to restart my consumers into an updated cluster topology after
the cluster has been expanded or contracted, would I need to call stop() on
them, then call start() on them, or would I need to instantiate and start
new context objects (new JavaStreamingContext(...)) ? I'm thinking of
Let me try to add some clarity in the different thought directions that's
going on in this thread.
1. HOW TO DETECT THE NEED FOR MORE CLUSTER RESOURCES?
If there are not rate limits set up, the most reliable way to detect
whether the current Spark cluster is being insufficient to handle the data
Yes, Tathagata, thank you.
For #1, the 'need detection', one idea we're entertaining is timestamping
the messages coming into the Kafka topics. The consumers would check the
interval between the time they get the message and that message origination
timestamp. As Kafka topics start to fill up
At which point would I call cache()? I just want the runtime to spill to
disk when necessary without me having to know when the necessary is.
On Thu, Jun 4, 2015 at 9:42 AM, Cody Koeninger c...@koeninger.org wrote:
direct stream isn't a receiver, it isn't required to cache data anywhere
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the storage level can be specified in the createStream methods but
not createDirectStream...
On Thu, May 28, 2015 at 9:05 AM, Evo Eftimov evo.efti...@isecc.com wrote:
You can also try Dynamic Resource Allocation
direct stream isn't a receiver, it isn't required to cache data anywhere
unless you want it to.
If you want it, just call cache.
On Thu, Jun 4, 2015 at 8:20 AM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:
set the storage policy for the DStream RDDs to MEMORY AND DISK - it
appears the
Great.
You should monitor vital performance / job clogging stats of the Spark
Streaming Runtime not “kafka topics” -- anything specific you were thinking
of?
On Wed, Jun 3, 2015 at 11:49 AM, Evo Eftimov evo.efti...@isecc.com wrote:
Makes sense especially if you have a cloud with “infinite”
If we have a hand-off between the older consumer and the newer consumer, I
wonder if we need to manually manage the offsets in Kafka so as not to miss
some messages as the hand-off is happening.
Or if we let the new consumer run for a bit then let the old consumer know
the 'new guy is in town'
Would it be possible to implement Spark autoscaling somewhat along these
lines? --
1. If we sense that a new machine is needed, by watching the data load in
Kafka topic(s), then
2. Provision a new machine via a Provisioner interface (e.g. talk to AWS
and get a machine);
3. Create a shadow/mirror
Makes sense especially if you have a cloud with “infinite” resources / nodes
which allows you to double, triple etc in the background/parallel the resources
of the currently running cluster
I was thinking more about the scenario where you have e.g. 100 boxes and want
to / can add e.g. 20
I think what we'd want to do is track the ingestion rate in the consumer(s)
via Spark's aggregation functions and such. If we're at a critical level
(load too high / load too low) then we issue a request into our
Provisioning Component to add/remove machines. Once it comes back with an
OK, each
Which would imply that if there was a load manager type of service, it
could signal to the driver(s) that they need to acquiesce, i.e. process
what's at hand and terminate. Then bring up a new machine, then restart
the driver(s)... Same deal with removing machines from the cluster. Send a
signal
Hi all,
As the author of the dynamic allocation feature I can offer a few insights
here.
Gerard's explanation was both correct and concise: dynamic allocation is
not intended to be used in Spark streaming at the moment (1.4 or before).
This is because of two things:
(1) Number of receivers is
Evo, good points.
On the dynamic resource allocation, I'm surmising this only works within a
particular cluster setup. So it improves the usage of current cluster
resources but it doesn't make the cluster itself elastic. At least, that's
my understanding.
Memory + disk would be good and
Probably you should ALWAYS keep the RDD storage policy to MEMORY AND DISK – it
will be your insurance policy against sys crashes due to memory leaks. Until
there is free RAM, spark streaming (spark) will NOT resort to disk – and of
course resorting to disk from time to time (ie when there is no
15 matches
Mail list logo