I understand that the DB writes are happening from the workers unless you
collect. My confusion is that you believe workers recompute on recovery("nodes
computations which get redone upon recovery"). My understanding is that
checkpointing dumps  the RDD to disk and the cuts the RDD lineage. So I
thought on driver restart you'll get a set of new executor processes but
they would read the last known state of the RDD from HDFS checkpoint. Am I
off here?

So the only situation I can imagine where you end up recomputing is if your
checkpointing at a larger interval than your batch size (i.e. the RDD on
disk does not reflect it's last precrash state)?


On Thu, Aug 28, 2014 at 1:32 PM, RodrigoB <rodrigo.boav...@aspect.com>
wrote:

> Hi Yana,
>
> The fact is that the DB writing is happening on the node level and not on
> Spark level. One of the benefits of distributed computing nature of Spark
> is
> enabling IO distribution as well. For example, is much faster to have the
> nodes to write to Cassandra instead of having them all collected at the
> driver level and sending the writes from there.
>
> The problem is that nodes computations which get redone upon recovery. If
> these lambda functions send events to other systems these events would get
> resent upon re-computation causing overall system instability.
>
> Hope this helps you understand the problematic.
>
> tnks,
> Rod
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-checkpoint-recovery-causes-IO-re-execution-tp12568p13043.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to