I noticed that Spark 2.4.0 implemented support for reading only committed
messages in Kafka, and was excited. Are there currently any plans to update
the Kafka output sink to support exactly-once delivery?
Thanks,
Will
In either case, end to end exactly once guarantee can only be ensured only
if the output sink is updated transactionally. The engine has to re execute
data on failure. Exactly once guarantee means that the external storage is
updated as if each data record was computed exactly once. That's wh
Thanks TD.
In pre-structured streaming, exactly once guarantee on input is not
guaranteed. is it?
On Tue, Jun 6, 2017 at 4:30 AM, Tathagata Das
wrote:
> This is the expected behavior. There are some confusing corner cases.
> If you are starting to play with Spark Streaming, i highly rec
668452 ms,
> 149668455 ms
> 17/06/05 13:42:31 INFO JobScheduler: Added jobs for time 149668428 ms
> 17/06/05 13:42:31 INFO JobScheduler: Starting job streaming job
> 149668428 ms.0 from job set of time 149668428 ms
>
>
>
> -----
1001560.n3.nabble.com/Fwd-Spark-Streaming-Checkpoint-and-Exactly-Once-Guarantee-on-Kafka-Direct-Stream-tp28743.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I am using Spark Streaming Checkpoint and Kafka Direct Stream.
It uses a 30 sec batch duration and normally the job is successful in 15-20
sec.
If the spark application fails after the successful completion
(149668428ms in the log below) and restarts, it's duplicating the last
batch again.
Is
ntics in the case of failure depend on how and when
offsets are stored. Spark output operations are at-least-once. So if
you want the equivalent of exactly-once semantics, you must either
store offsets after an idempotent output, or store offsets in an
atomic transaction alongside output.
"
If
15 ->
> > 100]);OffsetRange(topic: 'Datalake', partition: 1, range: [15 ->
> > 100]);OffsetRange(topic: 'Datalake', partition: 2, range: [14 ->
> > 100]);OffsetRange(topic: 'Datalake', partition: 0, range: [250959 ->
> > 251044])
> >
rtition: 0, range: [250959 ->
> 251044])
>
>
> Notice that in partition 0, for example, the 3 messages with offsets 250959
> through 250961 are being processed twice - once by the old job, and once by
> the new. I would have expected that in the new run
0, range: [250959 ->
251044])
Notice that in partition 0, for example, the 3 messages with offsets 250959
through 250961 are being processed twice - once by the old job, and once by
the new. I would have expected that in the new run, the offset range for
partition 0 would have been 250962 -&g
Ok,
Thanks for your answers
On 3/22/17, 1:34 PM, "Cody Koeninger" wrote:
If you're talking about reading the same message multiple times in a
failure situation, see
https://github.com/koeninger/kafka-exactly-once
If you're talking about produc
If you're talking about reading the same message multiple times in a
failure situation, see
https://github.com/koeninger/kafka-exactly-once
If you're talking about producing the same message multiple times in a
failure situation, keep an eye on
https://cwiki.apache.org/confluence/dis
You have to handle de-duplication upstream or downstream. It might
technically be possible to handle this in Spark but you'll probably have a
better time handling duplicates in the service that reads from Kafka.
On Wed, Mar 22, 2017 at 1:49 PM, Maurin Lenglart
wrote:
> Hi,
> we are trying to bui
Hi,
we are trying to build a spark streaming solution that subscribe and push to
kafka.
But we are running into the problem of duplicates events.
Right now, I am doing a “forEachRdd” and loop over the message of each
partition and send those message to kafka.
Is there any good way of solving tha
Have you read / watched the materials linked from
https://github.com/koeninger/kafka-exactly-once
On Mon, Dec 5, 2016 at 4:17 AM, Jörn Franke wrote:
> You need to do the book keeping of what has been processed yourself. This
> may mean roughly the following (of course the devil is in the d
itself in
zookeeper).
Once you start a processing job, check in zookeeper if it has been processed,
if not remove all staging data, if yes terminate.
As I said the details depend on your job and require some careful thinking, but
exactly once can be achieved with Spark (and potentially zookeeper or
The boundary is a bit flexible. In terms of observed DStream effective
state the direct stream semantics is exactly-once.
In terms of external system observations (like message emission), Spark
Streaming semantics is at-least-once.
Regards,
Piotr
On Mon, Dec 5, 2016 at 8:59 AM, Michal Šenkýř
ocessing. There is
no way that I know of to ensure exactly-once. You can try to minimize
more-than-once situations by updating your offsets as soon as possible
but that does not eliminate the problem entirely.
Hope this helps,
Michal Senkyr
1. If a task complete the operation, it will notify driver. The driver may not
receive the message due to the network, and think the task is still running.
Then the child stage won't be scheduled ?
2. how do spark guarantee the downstream-task can receive the shuffle-data
completely. As fact, I
that case, no? (make sure to expire items from the map)
>>
>>
>>
>> On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN
>> wrote:
>> > Hi,
>> >
>> > I'm currently implementing an exactly once mechanism based on the
>> > fol
ems from the map)
>
>
>
> On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN
> wrote:
> > Hi,
> >
> > I'm currently implementing an exactly once mechanism based on the
> following
> > example:
> >
> > https://github.com/koeninger/kafka-exactly
in the database? Just restart from the offsets in the database. I
think your solution of map of batchtime to offset ranges would work
fine in that case, no? (make sure to expire items from the map)
On Fri, Oct 21, 2016 at 3:32 AM, Erwan ALLAIN wrote:
> Hi,
>
> I'm currently
Hi,
I'm currently implementing an exactly once mechanism based on the following
example:
https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/TransactionalPerBatch.scala
the pseudo code is as follow:
dstream.transform (store offset in a variable on driver
shing out the data
>
> SparkStreaming + Kafka only provide an exactly-once guarantee on step 1 & 2
> We need to ensure exactly once on step 3 by myself.
>
> More details see base on
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
> <http://spark.apache.or
I see now.
There are three steps in SparkStreaming + Kafka date processing
1.Receiving the data
2.Transforming the data
3.Pushing out the data
SparkStreaming + Kafka only provide an exactly-once guarantee on step 1 & 2
We need to ensure exactly once on step 3 by myself.
More details see
Josh,
On Sun, Nov 30, 2014 at 10:17 PM, Josh J wrote:
>
> I would like to setup a Kafka pipeline whereby I write my data to a single
> topic 1, then I continue to process using spark streaming and write the
> transformed results to topic2, and finally I read the results from topic 2.
>
Not reall
treaming so that I can maintain exactly once
semantics when writing to topic 2?
Thanks,
Josh
Does spark in general assure exactly once semantics? What happens to
those guarantees in the presence of updateStateByKey operations -- are
they also assured to be exactly once?
Thanks
manku.timma at outlook dot com
Yup, it only returns each value once.
Matei
On Mar 17, 2014, at 1:14 PM, Adrian Mocanu wrote:
> Hi
> Quick question here,
> I know that .foreach is not idempotent. I am wondering if collect() is
> idempotent? Meaning that once I’ve collect()-ed if spark node crashes I can’t
> get the same va
Hi
Quick question here,
I know that .foreach is not idempotent. I am wondering if collect() is
idempotent? Meaning that once I've collect()-ed if spark node crashes I can't
get the same values from the stream ever again.
Thanks
-Adrian
30 matches
Mail list logo