The boundary is a bit flexible. In terms of observed DStream effective state the direct stream semantics is exactly-once. In terms of external system observations (like message emission), Spark Streaming semantics is at-least-once.
Regards, Piotr On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř <bina...@gmail.com> wrote: > Hello John, > > 1. If a task complete the operation, it will notify driver. The driver may > not receive the message due to the network, and think the task is still > running. Then the child stage won't be scheduled ? > > Spark's fault tolerance policy is, if there is a problem in processing a > task or an executor is lost, run the task (and any dependent tasks) again. > Spark attempts to minimize the number of tasks it has to recompute, so > usually only a small part of the data is recomputed. > > So in your case, the driver simply schedules the task on another executor > and continues to the next stage when it receives the data. > > 2. how do spark guarantee the downstream-task can receive the shuffle-data > completely. As fact, I can't find the checksum for blocks in spark. For > example, the upstream-task may shuffle 100Mb data, but the downstream-task > may receive 99Mb data due to network. Can spark verify the data is received > completely based size ? > > Spark uses compression with checksuming for shuffle data so it should know > when the data is corrupt and initiate a recomputation. > > As for your question in the subject: > All of this means that Spark supports at-least-once processing. There is > no way that I know of to ensure exactly-once. You can try to minimize > more-than-once situations by updating your offsets as soon as possible but > that does not eliminate the problem entirely. > > Hope this helps, > > Michal Senkyr >