Exactly-Once delivery with Structured Streaming and Kafka

2019-01-31 Thread William Briggs
I noticed that Spark 2.4.0 implemented support for reading only committed messages in Kafka, and was excited. Are there currently any plans to update the Kafka output sink to support exactly-once delivery? Thanks, Will

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
In either case, end to end exactly once guarantee can only be ensured only if the output sink is updated transactionally. The engine has to re execute data on failure. Exactly once guarantee means that the external storage is updated as if each data record was computed exactly once. That's wh

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread ALunar Beach
Thanks TD. In pre-structured streaming, exactly once guarantee on input is not guaranteed. is it? On Tue, Jun 6, 2017 at 4:30 AM, Tathagata Das wrote: > This is the expected behavior. There are some confusing corner cases. > If you are starting to play with Spark Streaming, i highly rec

Re: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-06 Thread Tathagata Das
668452 ms, > 149668455 ms > 17/06/05 13:42:31 INFO JobScheduler: Added jobs for time 149668428 ms > 17/06/05 13:42:31 INFO JobScheduler: Starting job streaming job > 149668428 ms.0 from job set of time 149668428 ms > > > > -----

Fwd: Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread anbucheeralan
1001560.n3.nabble.com/Fwd-Spark-Streaming-Checkpoint-and-Exactly-Once-Guarantee-on-Kafka-Direct-Stream-tp28743.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Spark Streaming Checkpoint and Exactly Once Guarantee on Kafka Direct Stream

2017-06-05 Thread ALunar Beach
I am using Spark Streaming Checkpoint and Kafka Direct Stream. It uses a 30 sec batch duration and normally the job is successful in 15-20 sec. If the spark application fails after the successful completion (149668428ms in the log below) and restarts, it's duplicating the last batch again. Is

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
ntics in the case of failure depend on how and when offsets are stored. Spark output operations are at-least-once. So if you want the equivalent of exactly-once semantics, you must either store offsets after an idempotent output, or store offsets in an atomic transaction alongside output. " If

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread David Rosenstrauch
15 -> > > 100]);OffsetRange(topic: 'Datalake', partition: 1, range: [15 -> > > 100]);OffsetRange(topic: 'Datalake', partition: 2, range: [14 -> > > 100]);OffsetRange(topic: 'Datalake', partition: 0, range: [250959 -> > > 251044]) > >

Re: Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread Cody Koeninger
rtition: 0, range: [250959 -> > 251044]) > > > Notice that in partition 0, for example, the 3 messages with offsets 250959 > through 250961 are being processed twice - once by the old job, and once by > the new. I would have expected that in the new run

Exactly-once semantics with kakfa CanCommitOffsets.commitAsync?

2017-04-28 Thread David Rosenstrauch
0, range: [250959 -> 251044]) Notice that in partition 0, for example, the 3 messages with offsets 250959 through 250961 are being processed twice - once by the old job, and once by the new. I would have expected that in the new run, the offset range for partition 0 would have been 250962 -&g

Re: Spark streaming to kafka exactly once

2017-03-23 Thread Maurin Lenglart
Ok, Thanks for your answers On 3/22/17, 1:34 PM, "Cody Koeninger" wrote: If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about produc

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Cody Koeninger
If you're talking about reading the same message multiple times in a failure situation, see https://github.com/koeninger/kafka-exactly-once If you're talking about producing the same message multiple times in a failure situation, keep an eye on https://cwiki.apache.org/confluence/dis

Re: Spark streaming to kafka exactly once

2017-03-22 Thread Matt Deaver
You have to handle de-duplication upstream or downstream. It might technically be possible to handle this in Spark but you'll probably have a better time handling duplicates in the service that reads from Kafka. On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart wrote: > Hi, > we are trying to bui

Spark streaming to kafka exactly once

2017-03-22 Thread Maurin Lenglart
Hi, we are trying to build a spark streaming solution that subscribe and push to kafka. But we are running into the problem of duplicates events. Right now, I am doing a “forEachRdd” and loop over the message of each partition and send those message to kafka. Is there any good way of solving tha

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Cody Koeninger
Have you read / watched the materials linked from https://github.com/koeninger/kafka-exactly-once On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote: > You need to do the book keeping of what has been processed yourself. This > may mean roughly the following (of course the devil is in the d

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Jörn Franke
itself in zookeeper). Once you start a processing job, check in zookeeper if it has been processed, if not remove all staging data, if yes terminate. As I said the details depend on your job and require some careful thinking, but exactly once can be achieved with Spark (and potentially zookeeper or

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-05 Thread Piotr Smoliński
The boundary is a bit flexible. In terms of observed DStream effective state the direct stream semantics is exactly-once. In terms of external system observations (like message emission), Spark Streaming semantics is at-least-once. Regards, Piotr On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř

Re: Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread Michal Šenkýř
ocessing. There is no way that I know of to ensure exactly-once. You can try to minimize more-than-once situations by updating your offsets as soon as possible but that does not eliminate the problem entirely. Hope this helps, Michal Senkyr

Can spark support exactly once based kafka ? Due to these following question?

2016-12-04 Thread John Fang
1. If a task complete the operation, it will notify driver. The driver may not receive the message due to the network, and think the task is still running. Then the child stage won't be scheduled ? 2. how do spark guarantee the downstream-task can receive the shuffle-data completely. As fact, I

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Cody Koeninger
that case, no? (make sure to expire items from the map) >> >> >> >> On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN >> wrote: >> > Hi, >> > >> > I'm currently implementing an exactly once mechanism based on the >> > fol

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
ems from the map) > > > > On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN > wrote: > > Hi, > > > > I'm currently implementing an exactly once mechanism based on the > following > > example: > > > > https://github.com/koeninger/kafka-exactly

Re: Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Cody Koeninger
in the database? Just restart from the offsets in the database. I think your solution of map of batchtime to offset ranges would work fine in that case, no? (make sure to expire items from the map) On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN wrote: > Hi, > > I'm currently

Kafka Direct Stream: Offset Managed Manually (Exactly Once)

2016-10-21 Thread Erwan ALLAIN
Hi, I'm currently implementing an exactly once mechanism based on the following example: https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala the pseudo code is as follow: dstream.transform (store offset in a variable on driver

Re: it seem like the exactly once feature not work on spark1.4

2015-07-17 Thread Tathagata Das
shing out the data > > SparkStreaming + Kafka only provide an exactly-once guarantee on step 1 & 2 > We need to ensure exactly once on step 3 by myself. > > More details see base on > http://spark.apache.org/docs/latest/streaming-programming-guide.html > <http://spark.apache.or

Re: it seem like the exactly once feature not work on spark1.4

2015-07-17 Thread JoneZhang
I see now. There are three steps in SparkStreaming + Kafka date processing 1.Receiving the data 2.Transforming the data 3.Pushing out the data SparkStreaming + Kafka only provide an exactly-once guarantee on step 1 & 2 We need to ensure exactly once on step 3 by myself. More details see

Re: kafka pipeline exactly once semantics

2014-11-30 Thread Tobias Pfeiffer
Josh, On Sun, Nov 30, 2014 at 10:17 PM, Josh J wrote: > > I would like to setup a Kafka pipeline whereby I write my data to a single > topic 1, then I continue to process using spark streaming and write the > transformed results to topic2, and finally I read the results from topic 2. > Not reall

kafka pipeline exactly once semantics

2014-11-30 Thread Josh J
treaming so that I can maintain exactly once semantics when writing to topic 2? Thanks, Josh

exactly once

2014-04-04 Thread Bharath Bhushan
Does spark in general assure exactly once semantics? What happens to those guarantees in the presence of updateStateByKey operations -- are they also assured to be exactly once? Thanks manku.timma at outlook dot com

Re: is collect exactly-once?

2014-03-17 Thread Matei Zaharia
Yup, it only returns each value once. Matei On Mar 17, 2014, at 1:14 PM, Adrian Mocanu wrote: > Hi > Quick question here, > I know that .foreach is not idempotent. I am wondering if collect() is > idempotent? Meaning that once I’ve collect()-ed if spark node crashes I can’t > get the same va

is collect exactly-once?

2014-03-17 Thread Adrian Mocanu
Hi Quick question here, I know that .foreach is not idempotent. I am wondering if collect() is idempotent? Meaning that once I've collect()-ed if spark node crashes I can't get the same values from the stream ever again. Thanks -Adrian