yes, it can recover on a different node. it uses write-ahead-log,
checkpoints offsets of both ingress and egress (e.g. using zookeeper and/or
kafka), replies on the streaming engine's deterministic operations.
by replaying back a certain range of data based on checkpointed
ingress offset (at
Unfortunately, there is not an obvious way to do this. I am guessing that
you want to partition your stream such that the same keys always go to the
same executor, right?
You could do it by writing a custom RDD. See ShuffledRDD
if RDDs from same DStream not guaranteed to run on same worker, then the
question becomes:
is it possible to specify an unlimited duration in ssc to have a continuous
stream (as opposed to discretized).
say, we have a per node streaming engine (built-in checkpoint and recovery)
we'd like to
What happens when a whole node running your " per node streaming engine
(built-in checkpoint and recovery)" fails? Can its checkpoint and recovery
mechanism handle whole node failure? Can you recover from the checkpoint on
a different node?
Spark and Spark Streaming were designed with the idea