Re: Structured Streaming: multiple sinks
My apologies Chris. Somehow I have not received the first email by OP, and hence thought our answers to OP as cryptic questions. :/ I found the full thread on nabble. I agree with your analysis of OP's question 1. On Fri, Aug 25, 2017 at 12:48 AM, Chris Bowden wrote: > Tathagata, thanks for filling in context for other readers on 2a and 2b, I > summarized too much in hindsight. > > Regarding the OP's first question, I was hinting it is quite natural to > chain processes via kafka. If you are already interested in writing > processed data to kafka, why add complexity to a job by having it commit > processed data to kafka and s3 vs. simply moving the processed data from > kafka out to s3 as needed. Perhaps the OP's thread got lost in context > based on how I responded. > > 1) We are consuming from kafka using structured streaming and writing > the processed data set to s3. > We also want to write the processed data to kafka moving forward, is it > possible to do it from the same streaming query ? (spark version 2.1.1) > > Streaming queries are currently bound to a single sink, so multiplexing > the write with existing sinks via the streaming query isn't possible > AFAIK. Arguably you can reuse the "processed data" DAG by starting multiple > sinks against it, though you will effectively process the data twice on > different "schedules" since each sink will effectively have its own > instance of StreamExecution, TriggerExecutor, etc. If you *really* wanted > to do one pass of the data and process the same exact block of data per > micro batch you could implement it via foreach or a custom sink which > writes to kafka and s3, but I wouldn't recommend it. As stated above, it is > quite natural to chain processes via kafka. > > On Thu, Aug 24, 2017 at 11:03 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Responses inline. >> >> On Thu, Aug 24, 2017 at 7:16 PM, cbowden wrote: >> >>> 1. would it not be more natural to write processed to kafka and sink >>> processed from kafka to s3? >>> >> >> I am sorry i dont fully understand this question. Could you please >> elaborate further, as in, what is more natural than what? >> >> >>> 2a. addBatch is the time Sink#addBatch took as measured by >>> StreamExecution. >>> >> >> Yes. This essentially includes the time taken to compute the output and >> finish writing the output to the sink. >> (**to give some context for other readers, this person is referring to >> the different time durations reported through StreamingQuery.lastProgress) >> >> >>> 2b. getBatch is the time Source#getBatch took as measured by >>> StreamExecution. >>> >> Yes, it is the time taken by the source prepare the DataFrame the has the >> new data to be processed in the trigger. >> Usually this is low, but its not guaranteed to be as some sources may >> require complicated tracking and bookkeeping to prepare the DataFrame. >> >> >>> 3. triggerExecution is effectively end-to-end processing time for the >>> micro-batch, note all other durations sum closely to triggerExecution, >>> there >>> is a little slippage based on book-keeping activities in StreamExecution. >>> >> >> Yes. Precisely. >> >> >>> >>> >>> >>> -- >>> View this message in context: http://apache-spark-user-list. >>> 1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp >>> 29056p29105.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >
Re: Structured Streaming: multiple sinks
Tathagata, thanks for filling in context for other readers on 2a and 2b, I summarized too much in hindsight. Regarding the OP's first question, I was hinting it is quite natural to chain processes via kafka. If you are already interested in writing processed data to kafka, why add complexity to a job by having it commit processed data to kafka and s3 vs. simply moving the processed data from kafka out to s3 as needed. Perhaps the OP's thread got lost in context based on how I responded. 1) We are consuming from kafka using structured streaming and writing the processed data set to s3. We also want to write the processed data to kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1) Streaming queries are currently bound to a single sink, so multiplexing the write with existing sinks via the streaming query isn't possible AFAIK. Arguably you can reuse the "processed data" DAG by starting multiple sinks against it, though you will effectively process the data twice on different "schedules" since each sink will effectively have its own instance of StreamExecution, TriggerExecutor, etc. If you *really* wanted to do one pass of the data and process the same exact block of data per micro batch you could implement it via foreach or a custom sink which writes to kafka and s3, but I wouldn't recommend it. As stated above, it is quite natural to chain processes via kafka. On Thu, Aug 24, 2017 at 11:03 PM, Tathagata Das wrote: > Responses inline. > > On Thu, Aug 24, 2017 at 7:16 PM, cbowden wrote: > >> 1. would it not be more natural to write processed to kafka and sink >> processed from kafka to s3? >> > > I am sorry i dont fully understand this question. Could you please > elaborate further, as in, what is more natural than what? > > >> 2a. addBatch is the time Sink#addBatch took as measured by >> StreamExecution. >> > > Yes. This essentially includes the time taken to compute the output and > finish writing the output to the sink. > (**to give some context for other readers, this person is referring to the > different time durations reported through StreamingQuery.lastProgress) > > >> 2b. getBatch is the time Source#getBatch took as measured by >> StreamExecution. >> > Yes, it is the time taken by the source prepare the DataFrame the has the > new data to be processed in the trigger. > Usually this is low, but its not guaranteed to be as some sources may > require complicated tracking and bookkeeping to prepare the DataFrame. > > >> 3. triggerExecution is effectively end-to-end processing time for the >> micro-batch, note all other durations sum closely to triggerExecution, >> there >> is a little slippage based on book-keeping activities in StreamExecution. >> > > Yes. Precisely. > > >> >> >> >> -- >> View this message in context: http://apache-spark-user-list. >> 1001560.n3.nabble.com/Structured-Streaming-multiple-sinks- >> tp29056p29105.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >
Re: Structured Streaming: multiple sinks
Responses inline. On Thu, Aug 24, 2017 at 7:16 PM, cbowden wrote: > 1. would it not be more natural to write processed to kafka and sink > processed from kafka to s3? > I am sorry i dont fully understand this question. Could you please elaborate further, as in, what is more natural than what? > 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution. > Yes. This essentially includes the time taken to compute the output and finish writing the output to the sink. (**to give some context for other readers, this person is referring to the different time durations reported through StreamingQuery.lastProgress) > 2b. getBatch is the time Source#getBatch took as measured by > StreamExecution. > Yes, it is the time taken by the source prepare the DataFrame the has the new data to be processed in the trigger. Usually this is low, but its not guaranteed to be as some sources may require complicated tracking and bookkeeping to prepare the DataFrame. > 3. triggerExecution is effectively end-to-end processing time for the > micro-batch, note all other durations sum closely to triggerExecution, > there > is a little slippage based on book-keeping activities in StreamExecution. > Yes. Precisely. > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Structured-Streaming-multiple- > sinks-tp29056p29105.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >
Re: Structured Streaming: multiple sinks
1. would it not be more natural to write processed to kafka and sink processed from kafka to s3? 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution. 2b. getBatch is the time Source#getBatch took as measured by StreamExecution. 3. triggerExecution is effectively end-to-end processing time for the micro-batch, note all other durations sum closely to triggerExecution, there is a little slippage based on book-keeping activities in StreamExecution. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056p29105.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Structured Streaming: multiple sinks
1) We are consuming from kafka using structured streaming and writing the processed data set to s3. We also want to write the processed data to kafka moving forward, is it possible to do it from the same streaming query ? (spark version 2.1.1) 2) In the logs, I see the streaming query progress output and I have a sample duration JSON from the log, can some one please provide more clarity on what the difference is between *addBatch* and *getBatch* ? 3) TriggerExecution - is it the time take to both process the fetched data and writing to the sink? "durationMs" : { "addBatch" : 2263426, "getBatch" : 12, "getOffset" : 273, "queryPlanning" : 13, "triggerExecution" : 2264288, "walCommit" : 552 }, regards aravias -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Structured-Streaming-multiple-sinks-tp29056.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org