Re: Are Spark Streaming RDDs always processed in order?
Great! That's what I gathered from the thread titled "Serial batching with Spark Streaming", but thanks for confirming this again. On 6 July 2015 at 15:31, Tathagata Das wrote: > Yes, RDD of batch t+1 will be processed only after RDD of batch t has been > processed. Unless there are errors where the batch completely fails to get > processed, in which case the point is moot. Just reinforcing the concept > further. > Additional information: This is true in the default configuration. You may > find references to an undocumented hidden configuration called > "spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting > that to more than 1 to get more concurrency (between output ops) *breaks* > the above guarantee. > > TD > > On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia wrote: > >> I had a similar inquiry, copied below. >> >> I was also looking into making an SQS Receiver reliable: >> >> http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming >> >> Hope this helps. >> >> -- Forwarded message -- >> From: Tathagata Das >> Date: 20 June 2015 at 17:21 >> Subject: Re: Serial batching with Spark Streaming >> To: Michal Čizmazia >> Cc: Binh Nguyen Van , user >> >> >> No it does not. By default, only after all the retries etc related to >> batch X is done, then batch X+1 will be started. >> >> Yes, one RDD per batch per DStream. However, the RDD could be a union of >> multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned >> DStream). >> >> TD >> >> On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia >> wrote: >> Thanks Tathagata! >> >> I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() >> then. >> >> Does the default scheduler initiate the execution of the *batch X+1* >> after the *batch X* even if tasks for the* batch X *need to be *retried >> due to failures*? If not, please could you suggest workarounds and point >> me to the code? >> >> One more thing was not 100% clear to me from the documentation: Is there >> exactly *1 RDD* published *per a batch interval* in a DStream? >> >> >> On 3 July 2015 at 22:12, khaledh wrote: >> >>> I'm writing a Spark Streaming application that uses RabbitMQ to consume >>> events. One feature of RabbitMQ that I intend to make use of is bulk ack >>> of >>> messages, i.e. no need to ack one-by-one, but only ack the last event in >>> a >>> batch and that would ack the entire batch. >>> >>> Before I commit to doing so, I'd like to know if Spark Streaming always >>> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives >>> before >>> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 >>> is >>> finished? >>> >>> This is crucial to the ack logic, since if RDD2 can be potentially >>> processed >>> while RDD1 is still being processed, then if I ack the the last event in >>> RDD2 that would also ack all events in RDD1, even though they may have >>> not >>> been completely processed yet. >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >
Re: Are Spark Streaming RDDs always processed in order?
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been processed. Unless there are errors where the batch completely fails to get processed, in which case the point is moot. Just reinforcing the concept further. Additional information: This is true in the default configuration. You may find references to an undocumented hidden configuration called "spark.streaming.concurrentJobs" elsewhere in the mailing list. Setting that to more than 1 to get more concurrency (between output ops) *breaks* the above guarantee. TD On Sat, Jul 4, 2015 at 6:53 AM, Michal Čizmazia wrote: > I had a similar inquiry, copied below. > > I was also looking into making an SQS Receiver reliable: > > http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming > > Hope this helps. > > -- Forwarded message -- > From: Tathagata Das > Date: 20 June 2015 at 17:21 > Subject: Re: Serial batching with Spark Streaming > To: Michal Čizmazia > Cc: Binh Nguyen Van , user > > > No it does not. By default, only after all the retries etc related to > batch X is done, then batch X+1 will be started. > > Yes, one RDD per batch per DStream. However, the RDD could be a union of > multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned > DStream). > > TD > > On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia > wrote: > Thanks Tathagata! > > I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then. > > Does the default scheduler initiate the execution of the *batch X+1* > after the *batch X* even if tasks for the* batch X *need to be *retried > due to failures*? If not, please could you suggest workarounds and point > me to the code? > > One more thing was not 100% clear to me from the documentation: Is there > exactly *1 RDD* published *per a batch interval* in a DStream? > > > On 3 July 2015 at 22:12, khaledh wrote: > >> I'm writing a Spark Streaming application that uses RabbitMQ to consume >> events. One feature of RabbitMQ that I intend to make use of is bulk ack >> of >> messages, i.e. no need to ack one-by-one, but only ack the last event in a >> batch and that would ack the entire batch. >> >> Before I commit to doing so, I'd like to know if Spark Streaming always >> processes RDDs in the same order they arrive in, i.e. if RDD1 arrives >> before >> RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 >> is >> finished? >> >> This is crucial to the ack logic, since if RDD2 can be potentially >> processed >> while RDD1 is still being processed, then if I ack the the last event in >> RDD2 that would also ack all events in RDD1, even though they may have not >> been completely processed yet. >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Are Spark Streaming RDDs always processed in order?
I had a similar inquiry, copied below. I was also looking into making an SQS Receiver reliable: http://stackoverflow.com/questions/30809975/reliable-sqs-receiver-for-spark-streaming Hope this helps. -- Forwarded message -- From: Tathagata Das Date: 20 June 2015 at 17:21 Subject: Re: Serial batching with Spark Streaming To: Michal Čizmazia Cc: Binh Nguyen Van , user No it does not. By default, only after all the retries etc related to batch X is done, then batch X+1 will be started. Yes, one RDD per batch per DStream. However, the RDD could be a union of multiple RDDs (e.g. RDDs generated by windowed DStream, or unioned DStream). TD On Fri, Jun 19, 2015 at 3:16 PM, Michal Čizmazia wrote: Thanks Tathagata! I will use *foreachRDD*/*foreachPartition*() instead of *trasform*() then. Does the default scheduler initiate the execution of the *batch X+1* after the *batch X* even if tasks for the* batch X *need to be *retried due to failures*? If not, please could you suggest workarounds and point me to the code? One more thing was not 100% clear to me from the documentation: Is there exactly *1 RDD* published *per a batch interval* in a DStream? On 3 July 2015 at 22:12, khaledh wrote: > I'm writing a Spark Streaming application that uses RabbitMQ to consume > events. One feature of RabbitMQ that I intend to make use of is bulk ack of > messages, i.e. no need to ack one-by-one, but only ack the last event in a > batch and that would ack the entire batch. > > Before I commit to doing so, I'd like to know if Spark Streaming always > processes RDDs in the same order they arrive in, i.e. if RDD1 arrives > before > RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is > finished? > > This is crucial to the ack logic, since if RDD2 can be potentially > processed > while RDD1 is still being processed, then if I ack the the last event in > RDD2 that would also ack all events in RDD1, even though they may have not > been completely processed yet. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: Are Spark Streaming RDDs always processed in order?
I dont think you can expect any order guarantee except the records in one partition. On Jul 4, 2015 7:43 AM, "khaledh" wrote: > I'm writing a Spark Streaming application that uses RabbitMQ to consume > events. One feature of RabbitMQ that I intend to make use of is bulk ack of > messages, i.e. no need to ack one-by-one, but only ack the last event in a > batch and that would ack the entire batch. > > Before I commit to doing so, I'd like to know if Spark Streaming always > processes RDDs in the same order they arrive in, i.e. if RDD1 arrives > before > RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is > finished? > > This is crucial to the ack logic, since if RDD2 can be potentially > processed > while RDD1 is still being processed, then if I ack the the last event in > RDD2 that would also ack all events in RDD1, even though they may have not > been completely processed yet. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Are Spark Streaming RDDs always processed in order?
I'm writing a Spark Streaming application that uses RabbitMQ to consume events. One feature of RabbitMQ that I intend to make use of is bulk ack of messages, i.e. no need to ack one-by-one, but only ack the last event in a batch and that would ack the entire batch. Before I commit to doing so, I'd like to know if Spark Streaming always processes RDDs in the same order they arrive in, i.e. if RDD1 arrives before RDD2, is it true that RDD2 will never be scheduled/processed before RDD1 is finished? This is crucial to the ack logic, since if RDD2 can be potentially processed while RDD1 is still being processed, then if I ack the the last event in RDD2 that would also ack all events in RDD1, even though they may have not been completely processed yet. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-Spark-Streaming-RDDs-always-processed-in-order-tp23616.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org