Re: Spark streaming at-least once guarantee

Tathagata Das Tue, 05 Aug 2014 19:31:30 -0700

I can try answering the question even if I am not Sanjeet ;)
There isnt a simple way to do this. In fact the ideal way to do it would be
to create a new InputDStream (just like FileInputDStream
<https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala>)
where you will create hadoop RDDs as SQS messages are received.


But stepping back, I want to understand why do you want to integrate with
Spark Streaming at all? If you already have an working system that runs
Spark jobs when SQS sends a message about new files, then why use Spark
Streaming at all? What is lacking in that implementation? Based on that its
worth going into the effort of implementing a new input stream.

TD


On Tue, Aug 5, 2014 at 12:45 AM, lalit1303 <la...@sigmoidanalytics.com>
wrote:

> Hi Sanjeet,
>
> I have been using spark streaming for processing of files present in S3 and
> HDFS.
> I am also using SQS messages for the same purpose as yours i.e. pointer to
> S3 file.
> As of now, I have a separate SQS job which receive message from SQS queue
> and gets the corresponding file from S3.
> Now, I wasnt to integrate the SQS receiver with spark streaming. Like, my
> spark streaming job would listen for new SQS messages and proceed
> accordingly.
> I was wondering if you find any solution to this. Please let me know in
> case!!
>
> In your above approach, you can achieve #4 in the following way:
> When you are passing a forEach function to be applied on each RDD of
> Dstream, you can pass information of SQS message (lke receipthandle for
> deleting message) associated with that particualar file.
> After success/failure in processing you can perform deletion of your SQS
> message accordingly.
>
>
> Thanks
> --Lalit
>
>
>
> -----
> Lalit Yadav
> la...@sigmoidanalytics.com
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-at-least-once-guarantee-tp10902p11419.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Spark streaming at-least once guarantee

Reply via email to