Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Khaled Hammouda
Great! That's what I gathered from the thread titled "Serial batching with
Spark Streaming", but thanks for confirming this again.

On 6 July 2015 at 15:31, Tathagata Das  wrote:

> Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
> processed. Unless there are errors where the batch completely fails to get
> processed, in which case the point is moot. Just reinforcing the concept
> further.
> Additional information: This is true in the default configuration. You may
> find references to an undocumented hidden configuration called
> "spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting
> that to more than 1 to get more concurrency (between output ops) *breaks*
> the above guarantee.
>
> TD
>
> On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia  wrote:
>
>> I had a similar inquiry, copied below.
>>
>> I was also looking into making an SQS Receiver reliable:
>>
>> http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming
>>
>> Hope this helps.
>>
>> -- Forwarded message --
>> From: Tathagata Das 
>> Date: 20 June 2015 at 17:21
>> Subject: Re: Serial batching with Spark Streaming
>> To: Michal Čizmazia 
>> Cc: Binh Nguyen Van , user 
>>
>>
>> No it does not. By default, only after all the retries etc related to
>> batch X is done, then batch X+1 will be started.
>>
>> Yes, one RDD per batch per DStream. However, the RDD could be a union of
>> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
>> DStream).
>>
>> TD
>>
>> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia 
>> wrote:
>> Thanks Tathagata!
>>
>> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*()
>> then.
>>
>> Does the default scheduler initiate the execution of the *batch X+1*
>> after the *batch X* even if tasks for the* batch X *need to be *retried
>> due to failures*? If not, please could you suggest workarounds and point
>> me to the code?
>>
>> One more thing was not 100% clear to me from the documentation: Is there
>> exactly *1 RDD* published *per a batch interval* in a DStream?
>>
>>
>> On 3 July 2015 at 22:12, khaledh  wrote:
>>
>>> I'm writing a Spark Streaming application that uses RabbitMQ to consume
>>> events. One feature of RabbitMQ that I intend to make use of is bulk ack
>>> of
>>> messages, i.e. no need to ack one-by-one, but only ack the last event in
>>> a
>>> batch and that would ack the entire batch.
>>>
>>> Before I commit to doing so, I'd like to know if Spark Streaming always
>>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
>>> before
>>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1
>>> is
>>> finished?
>>>
>>> This is crucial to the ack logic, since if RDD2 can be potentially
>>> processed
>>> while RDD1 is still being processed, then if I ack the the last event in
>>> RDD2 that would also ack all events in RDD1, even though they may have
>>> not
>>> been completely processed yet.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Are Spark Streaming RDDs always processed in order?

2015-07-06 Thread Tathagata Das
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
processed. Unless there are errors where the batch completely fails to get
processed, in which case the point is moot. Just reinforcing the concept
further.
Additional information: This is true in the default configuration. You may
find references to an undocumented hidden configuration called
"spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting
that to more than 1 to get more concurrency (between output ops) *breaks*
the above guarantee.

TD

On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia  wrote:

> I had a similar inquiry, copied below.
>
> I was also looking into making an SQS Receiver reliable:
>
> http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming
>
> Hope this helps.
>
> -- Forwarded message --
> From: Tathagata Das 
> Date: 20 June 2015 at 17:21
> Subject: Re: Serial batching with Spark Streaming
> To: Michal Čizmazia 
> Cc: Binh Nguyen Van , user 
>
>
> No it does not. By default, only after all the retries etc related to
> batch X is done, then batch X+1 will be started.
>
> Yes, one RDD per batch per DStream. However, the RDD could be a union of
> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
> DStream).
>
> TD
>
> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia 
> wrote:
> Thanks Tathagata!
>
> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then.
>
> Does the default scheduler initiate the execution of the *batch X+1*
> after the *batch X* even if tasks for the* batch X *need to be *retried
> due to failures*? If not, please could you suggest workarounds and point
> me to the code?
>
> One more thing was not 100% clear to me from the documentation: Is there
> exactly *1 RDD* published *per a batch interval* in a DStream?
>
>
> On 3 July 2015 at 22:12, khaledh  wrote:
>
>> I'm writing a Spark Streaming application that uses RabbitMQ to consume
>> events. One feature of RabbitMQ that I intend to make use of is bulk ack
>> of
>> messages, i.e. no need to ack one-by-one, but only ack the last event in a
>> batch and that would ack the entire batch.
>>
>> Before I commit to doing so, I'd like to know if Spark Streaming always
>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
>> before
>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1
>> is
>> finished?
>>
>> This is crucial to the ack logic, since if RDD2 can be potentially
>> processed
>> while RDD1 is still being processed, then if I ack the the last event in
>> RDD2 that would also ack all events in RDD1, even though they may have not
>> been completely processed yet.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Are Spark Streaming RDDs always processed in order?

2015-07-04 Thread Michal Čizmazia
I had a similar inquiry, copied below.

I was also looking into making an SQS Receiver reliable:
http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming

Hope this helps.

-- Forwarded message --
From: Tathagata Das 
Date: 20 June 2015 at 17:21
Subject: Re: Serial batching with Spark Streaming
To: Michal Čizmazia 
Cc: Binh Nguyen Van , user 


No it does not. By default, only after all the retries etc related to batch
X is done, then batch X+1 will be started.

Yes, one RDD per batch per DStream. However, the RDD could be a union of
multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned
DStream).

TD

On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia  wrote:
Thanks Tathagata!

I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then.

Does the default scheduler initiate the execution of the *batch X+1* after
the *batch X* even if tasks for the* batch X *need to be *retried due to
failures*? If not, please could you suggest workarounds and point me to the
code?

One more thing was not 100% clear to me from the documentation: Is there
exactly *1 RDD* published *per a batch interval* in a DStream?


On 3 July 2015 at 22:12, khaledh  wrote:

> I'm writing a Spark Streaming application that uses RabbitMQ to consume
> events. One feature of RabbitMQ that I intend to make use of is bulk ack of
> messages, i.e. no need to ack one-by-one, but only ack the last event in a
> batch and that would ack the entire batch.
>
> Before I commit to doing so, I'd like to know if Spark Streaming always
> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
> before
> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
> finished?
>
> This is crucial to the ack logic, since if RDD2 can be potentially
> processed
> while RDD1 is still being processed, then if I ack the the last event in
> RDD2 that would also ack all events in RDD1, even though they may have not
> been completely processed yet.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread Raghavendra Pandey
I dont think you can expect any order guarantee except the records in one
partition.
 On Jul 4, 2015 7:43 AM, "khaledh"  wrote:

> I'm writing a Spark Streaming application that uses RabbitMQ to consume
> events. One feature of RabbitMQ that I intend to make use of is bulk ack of
> messages, i.e. no need to ack one-by-one, but only ack the last event in a
> batch and that would ack the entire batch.
>
> Before I commit to doing so, I'd like to know if Spark Streaming always
> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives
> before
> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
> finished?
>
> This is crucial to the ack logic, since if RDD2 can be potentially
> processed
> while RDD1 is still being processed, then if I ack the the last event in
> RDD2 that would also ack all events in RDD1, even though they may have not
> been completely processed yet.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Are Spark Streaming RDDs always processed in order?

2015-07-03 Thread khaledh
I'm writing a Spark Streaming application that uses RabbitMQ to consume
events. One feature of RabbitMQ that I intend to make use of is bulk ack of
messages, i.e. no need to ack one-by-one, but only ack the last event in a
batch and that would ack the entire batch.

Before I commit to doing so, I'd like to know if Spark Streaming always
processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before
RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is
finished?

This is crucial to the ack logic, since if RDD2 can be potentially processed
while RDD1 is still being processed, then if I ack the the last event in
RDD2 that would also ack all events in RDD1, even though they may have not
been completely processed yet.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org