Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Khaled Hammouda
Great! That's what I gathered from the thread titled "Serial batching with
Spark Streaming", but thanks for confirming this again.

On 6 July 2015 at 15:31, Tathagata Das  wrote:

> Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
> processed. Unless there are errors where the batch completely fails to get
> processed, in which case the point is moot. Just reinforcing the concept
> further.
> Additional information: This is true in the default configuration. You may
> find references to an undocumented hidden configuration called
> "spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting
> that to more than 1 to get more concurrency (between output ops) *breaks*
> the above guarantee.
>
> TD
>
> On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia  wrote:
>
>> I had a similar inquiry, copied below.
>>
>> I was also looking into making an SQS Receiver reliable:
>>
>> http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming
>>
>> Hope this helps.
>>
>> -- Forwarded message --
>> From: Tathagata Das 
>> Date: 20 June 2015 at 17:21
>> Subject: Re: Serial batching with Spark Streaming
>> To: Michal Čizmazia 
>> Cc: Binh Nguyen Van , user 
>>
>>
>> No it does not. By default, only after all the retries etc related to
>> batch X is done, then batch X+1 will be started.
>>
>> Yes, one RDD per batch per DStream. However, the RDD could be a union of
>> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
>> DStream).
>>
>> TD
>>
>> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia 
>> wrote:
>> Thanks Tathagata!
>>
>> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*()
>> then.
>>
>> Does the default scheduler initiate the execution of the *batch X+1*
>> after the *batch X* even if tasks for the* batch X *need to be *retried
>> due to failures*? If not, please could you suggest workarounds and point
>> me to the code?
>>
>> One more thing was not 100% clear to me from the documentation: Is there
>> exactly *1 RDD* published *per a batch interval* in a DStream?
>>
>>
>> On 3 July 2015 at 22:12, khaledh  wrote:
>>
>>> I'm writing a Spark Streaming application that uses RabbitMQ to consume
>>> events. One feature of RabbitMQ that I intend to make use of is bulk ack
>>> of
>>> messages, i.e. no need to ack one-by-one, but only ack the last event in
>>> a
>>> batch and that would ack the entire batch.
>>>
>>> Before I commit to doing so, I'd like to know if Spark Streaming always
>>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
>>> before
>>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1
>>> is
>>> finished?
>>>
>>> This is crucial to the ack logic, since if RDD2 can be potentially
>>> processed
>>> while RDD1 is still being processed, then if I ack the the last event in
>>> RDD2 that would also ack all events in RDD1, even though they may have
>>> not
>>> been completely processed yet.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
processed. Unless there are errors where the batch completely fails to get
processed, in which case the point is moot. Just reinforcing the concept
further.
Additional information: This is true in the default configuration. You may
find references to an undocumented hidden configuration called
"spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting
that to more than 1 to get more concurrency (between output ops) *breaks*
the above guarantee.

TD

On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia  wrote:

> I had a similar inquiry, copied below.
>
> I was also looking into making an SQS Receiver reliable:
>
> http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming
>
> Hope this helps.
>
> -- Forwarded message --
> From: Tathagata Das 
> Date: 20 June 2015 at 17:21
> Subject: Re: Serial batching with Spark Streaming
> To: Michal Čizmazia 
> Cc: Binh Nguyen Van , user 
>
>
> No it does not. By default, only after all the retries etc related to
> batch X is done, then batch X+1 will be started.
>
> Yes, one RDD per batch per DStream. However, the RDD could be a union of
> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
> DStream).
>
> TD
>
> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia 
> wrote:
> Thanks Tathagata!
>
> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then.
>
> Does the default scheduler initiate the execution of the *batch X+1*
> after the *batch X* even if tasks for the* batch X *need to be *retried
> due to failures*? If not, please could you suggest workarounds and point
> me to the code?
>
> One more thing was not 100% clear to me from the documentation: Is there
> exactly *1 RDD* published *per a batch interval* in a DStream?
>
>
> On 3 July 2015 at 22:12, khaledh  wrote:
>
>> I'm writing a Spark Streaming application that uses RabbitMQ to consume
>> events. One feature of RabbitMQ that I intend to make use of is bulk ack
>> of
>> messages, i.e. no need to ack one-by-one, but only ack the last event in a
>> batch and that would ack the entire batch.
>>
>> Before I commit to doing so, I'd like to know if Spark Streaming always
>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
>> before
>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1
>> is
>> finished?
>>
>> This is crucial to the ack logic, since if RDD2 can be potentially
>> processed
>> while RDD1 is still being processed, then if I ack the the last event in
>> RDD2 that would also ack all events in RDD1, even though they may have not
>> been completely processed yet.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Are Spark Streaming RDDs always processed in order?

2015-07-04 Thread Michal Čizmazia
I had a similar inquiry, copied below.

I was also looking into making an SQS Receiver reliable:
http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming

Hope this helps.

-- Forwarded message --
From: Tathagata Das 
Date: 20 June 2015 at 17:21
Subject: Re: Serial batching with Spark Streaming
To: Michal Čizmazia 
Cc: Binh Nguyen Van , user 


No it does not. By default, only after all the retries etc related to batch
X is done, then batch X+1 will be started.

Yes, one RDD per batch per DStream. However, the RDD could be a union of
multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
DStream).

TD

On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia  wrote:
Thanks Tathagata!

I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then.

Does the default scheduler initiate the execution of the *batch X+1* after
the *batch X* even if tasks for the* batch X *need to be *retried due to
failures*? If not, please could you suggest workarounds and point me to the
code?

One more thing was not 100% clear to me from the documentation: Is there
exactly *1 RDD* published *per a batch interval* in a DStream?


On 3 July 2015 at 22:12, khaledh  wrote:

> I'm writing a Spark Streaming application that uses RabbitMQ to consume
> events. One feature of RabbitMQ that I intend to make use of is bulk ack of
> messages, i.e. no need to ack one-by-one, but only ack the last event in a
> batch and that would ack the entire batch.
>
> Before I commit to doing so, I'd like to know if Spark Streaming always
> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
> before
> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
> finished?
>
> This is crucial to the ack logic, since if RDD2 can be potentially
> processed
> while RDD1 is still being processed, then if I ack the the last event in
> RDD2 that would also ack all events in RDD1, even though they may have not
> been completely processed yet.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one
partition.
 On Jul 4, 2015 7:43 AM, "khaledh"  wrote:

> I'm writing a Spark Streaming application that uses RabbitMQ to consume
> events. One feature of RabbitMQ that I intend to make use of is bulk ack of
> messages, i.e. no need to ack one-by-one, but only ack the last event in a
> batch and that would ack the entire batch.
>
> Before I commit to doing so, I'd like to know if Spark Streaming always
> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
> before
> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
> finished?
>
> This is crucial to the ack logic, since if RDD2 can be potentially
> processed
> while RDD1 is still being processed, then if I ack the the last event in
> RDD2 that would also ack all events in RDD1, even though they may have not
> been completely processed yet.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>