Re: Are Spark Streaming RDDs always processed in order?
Great! That's what I gathered from the thread titled "Serial batching with Spark Streaming", but thanks for confirming this again. On 6 July 2015 at 15:31, Tathagata Das wrote: > Yes, RDD of batch t+1 will be processed only after RDD of batch t has been > processed. Unless there are errors where the batch completely fails to get > processed, in which case the point is moot. Just reinforcing the concept > further. > Additional information: This is true in the default configuration. You may > find references to an undocumented hidden configuration called > "spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting > that to more than 1 to get more concurrency (between output ops) *breaks* > the above guarantee. > > TD > > On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia wrote: > >> I had a similar inquiry, copied below. >> >> I was also looking into making an SQS Receiver reliable: >> >> http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming >> >> Hope this helps. >> >> -- Forwarded message -- >> From: Tathagata Das >> Date: 20 June 2015 at 17:21 >> Subject: Re: Serial batching with Spark Streaming >> To: Michal Čizmazia >> Cc: Binh Nguyen Van , user >> >> >> No it does not. By default, only after all the retries etc related to >> batch X is done, then batch X+1 will be started. >> >> Yes, one RDD per batch per DStream. However, the RDD could be a union of >> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned >> DStream). >> >> TD >> >> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia >> wrote: >> Thanks Tathagata! >> >> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() >> then. >> >> Does the default scheduler initiate the execution of the *batch X+1* >> after the *batch X* even if tasks for the* batch X *need to be *retried >> due to failures*? If not, please could you suggest workarounds and point >> me to the code? >> >> One more thing was not 100% clear to me from the documentation: Is there >> exactly *1 RDD* published *per a batch interval* in a DStream? >> >> >> On 3 July 2015 at 22:12, khaledh wrote: >> >>> I'm writing a Spark Streaming application that uses RabbitMQ to consume >>> events. One feature of RabbitMQ that I intend to make use of is bulk ack >>> of >>> messages, i.e. no need to ack one-by-one, but only ack the last event in >>> a >>> batch and that would ack the entire batch. >>> >>> Before I commit to doing so, I'd like to know if Spark Streaming always >>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives >>> before >>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 >>> is >>> finished? >>> >>> This is crucial to the ack logic, since if RDD2 can be potentially >>> processed >>> while RDD1 is still being processed, then if I ack the the last event in >>> RDD2 that would also ack all events in RDD1, even though they may have >>> not >>> been completely processed yet. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: Are Spark Streaming RDDs always processed in order?
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been processed. Unless there are errors where the batch completely fails to get processed, in which case the point is moot. Just reinforcing the concept further. Additional information: This is true in the default configuration. You may find references to an undocumented hidden configuration called "spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting that to more than 1 to get more concurrency (between output ops) *breaks* the above guarantee. TD On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia wrote: > I had a similar inquiry, copied below. > > I was also looking into making an SQS Receiver reliable: > > http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming > > Hope this helps. > > -- Forwarded message -- > From: Tathagata Das > Date: 20 June 2015 at 17:21 > Subject: Re: Serial batching with Spark Streaming > To: Michal Čizmazia > Cc: Binh Nguyen Van , user > > > No it does not. By default, only after all the retries etc related to > batch X is done, then batch X+1 will be started. > > Yes, one RDD per batch per DStream. However, the RDD could be a union of > multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned > DStream). > > TD > > On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia > wrote: > Thanks Tathagata! > > I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then. > > Does the default scheduler initiate the execution of the *batch X+1* > after the *batch X* even if tasks for the* batch X *need to be *retried > due to failures*? If not, please could you suggest workarounds and point > me to the code? > > One more thing was not 100% clear to me from the documentation: Is there > exactly *1 RDD* published *per a batch interval* in a DStream? > > > On 3 July 2015 at 22:12, khaledh wrote: > >> I'm writing a Spark Streaming application that uses RabbitMQ to consume >> events. One feature of RabbitMQ that I intend to make use of is bulk ack >> of >> messages, i.e. no need to ack one-by-one, but only ack the last event in a >> batch and that would ack the entire batch. >> >> Before I commit to doing so, I'd like to know if Spark Streaming always >> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives >> before >> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 >> is >> finished? >> >> This is crucial to the ack logic, since if RDD2 can be potentially >> processed >> while RDD1 is still being processed, then if I ack the the last event in >> RDD2 that would also ack all events in RDD1, even though they may have not >> been completely processed yet. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Are Spark Streaming RDDs always processed in order?
I had a similar inquiry, copied below. I was also looking into making an SQS Receiver reliable: http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming Hope this helps. -- Forwarded message -- From: Tathagata Das Date: 20 June 2015 at 17:21 Subject: Re: Serial batching with Spark Streaming To: Michal Čizmazia Cc: Binh Nguyen Van , user No it does not. By default, only after all the retries etc related to batch X is done, then batch X+1 will be started. Yes, one RDD per batch per DStream. However, the RDD could be a union of multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned DStream). TD On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia wrote: Thanks Tathagata! I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then. Does the default scheduler initiate the execution of the *batch X+1* after the *batch X* even if tasks for the* batch X *need to be *retried due to failures*? If not, please could you suggest workarounds and point me to the code? One more thing was not 100% clear to me from the documentation: Is there exactly *1 RDD* published *per a batch interval* in a DStream? On 3 July 2015 at 22:12, khaledh wrote: > I'm writing a Spark Streaming application that uses RabbitMQ to consume > events. One feature of RabbitMQ that I intend to make use of is bulk ack of > messages, i.e. no need to ack one-by-one, but only ack the last event in a > batch and that would ack the entire batch. > > Before I commit to doing so, I'd like to know if Spark Streaming always > processes RDDs in the same order they arrive in, i.e. if RDD1 arrives > before > RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is > finished? > > This is crucial to the ack logic, since if RDD2 can be potentially > processed > while RDD1 is still being processed, then if I ack the the last event in > RDD2 that would also ack all events in RDD1, even though they may have not > been completely processed yet. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Are Spark Streaming RDDs always processed in order?
I dont think you can expect any order guarantee except the records in one partition. On Jul 4, 2015 7:43 AM, "khaledh" wrote: > I'm writing a Spark Streaming application that uses RabbitMQ to consume > events. One feature of RabbitMQ that I intend to make use of is bulk ack of > messages, i.e. no need to ack one-by-one, but only ack the last event in a > batch and that would ack the entire batch. > > Before I commit to doing so, I'd like to know if Spark Streaming always > processes RDDs in the same order they arrive in, i.e. if RDD1 arrives > before > RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is > finished? > > This is crucial to the ack logic, since if RDD2 can be potentially > processed > while RDD1 is still being processed, then if I ack the the last event in > RDD2 that would also ack all events in RDD1, even though they may have not > been completely processed yet. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >