Re: ForeachBatch Structured Streaming

muru Thu, 15 Oct 2020 16:31:51 -0700

To achieve exactly-once with foreachBatch in SS, you must have a checkpoint
enabled. In case of any exceptions or failures the spark SS job will get
restarted and the same batchID reprocessed again (for any data sources). To
avoid duplicates, you should have an external system to store and dedupe
the same batchIds.


On Wed, Oct 14, 2020 at 12:11 AM German Schiavon <gschiavonsp...@gmail.com>
wrote:

> Hi!
>
> In the documentation it says:
>
>
>    - By default, foreachBatch provides only at-least-once write
>    guarantees. However, you can use the batchId provided to the function as
>    way to deduplicate the output and get an exactly-once guarantee.
>
>
> Taking the example snippet :
>
>
> streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
>   batchDF.persist()
>   batchDF.write.format(...).save(...)  // location 1
>   batchDF.write.format(...).save(...)  // location 2
>   batchDF.unpersist()}
>
>
> Let's assume I'm reading from Kafka, that means that by default *batchDF *may
> or may not have duplicates?
>
> Thanks!
>
>
>
>
>
>

Re: ForeachBatch Structured Streaming

Reply via email to