Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner
Thanks Lukasz. It was really helpful to understand. Regards, Chandan On Thu, May 17, 2018 at 8:26 PM, Lukasz Cwik wrote: > > > On Wed, May 16, 2018 at 10:46 PM chandan prakash < > chandanbaran...@gmail.com> wrote: > >> Thanks Ismaël. >> Your answers were quite useful for a novice user. >> I guess this answer will help many like me. >> >> *Regarding your answer to point 2 :* >> >> >> *"Checkpointing is supported, Kafka offset management (if I understand >> whatyoumean) is managed by the KafkaIO connector + the runner"* >> >> Beam provides IO connectors like KafkaIO, etc. >> >>1. Does it mean that Beam code deals with fetching data from source >>and writing to sink (after processing is done by chosen runner) ? >>2. If ans of 1 is YES, then it means Beam is responsible for data >>transfer between source/sink to processing machines . In such case, >> underlying >>runner like Spark is only responsible for doing processing from one >>PCollection -> other PCollection and not for data transfer from/to >>source/sink . >>3. If ans of 1 is NO, then it means Beam is only giving abstraction >>for IO as well but internally data transfer+ processing, both are done by >>runner like Spark? >> >> Can you please correct me which assumption is right, point 2 or point3. >> Feel free to add anything else which I might missed in understanding. >> >> Yes, Apache Beam code is responsible for reading from sources and writing > to sinks (typically sinks are a set of PTransforms). Sources are slightly > more complicated because a runner interacts with a user written source > through interfaces like BoundedSource and UnboundedSource to support > progress reporting, splitting, checkpointing, This means that all > runners (if they abide by the source contracts) support all user written > sources. The Apache Beam sources are "user" written sources that implement > either the BoundedSource or UnboundedSource interface. > > >> Also, >> It will be very useful if in the following sample MinimalWordCount >> program, it can be clearly pointed out which part is executed by Beam and >> which part is executed by runner : >> >> PipelineOptions options = PipelineOptionsFactory.create(); >> >> Pipeline p = Pipeline.create(options); >> >> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*")) >> >> .apply(FlatMapElements >> .into(TypeDescriptors.strings()) >> .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+" >> >> .apply(Filter.by((String word) -> !word.isEmpty())) >> >> .apply(Count.perElement()) >> >> .apply(MapElements >> .into(TypeDescriptors.strings()) >> .via((KV wordCount) -> wordCount.getKey() + ": " + >> wordCount.getValue())) >> >> .apply(TextIO.write().to("chandan")); >> >> p.run().waitUntilFinish(); >> >> >> As discussed above, runners are required to fulfill the BoundedSource > contract to power TextIO. Runners are also responsible for making sure > elements produced by a PTransform are able to be consumed by the next > PTransform (e.g. words that are sent to the filter function that are output > by the filter function are able to be consumed by the count). And finally, > runners are responsible for implementing GroupByKey (the Count.perElement > transform is composed of a GroupByKey followed by a combiner). Runners are > also responsible for the lifecycle of the job, (e.g. distributing your > application to a cluster of machines and managing execution across that > cluster). > > >> >> Thanks & Regards, >> Chandan >> >> >> >> >> >> >> >> >> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía wrote: >> >>> Hello, >>> >>> Answers to the questions inline: >>> >>> > 1. Are there any limitations in terms of implementations, >>> functionalities >>> or performance if we want to run streaming on Beam with Spark runner vs >>> streaming on Spark-Streaming directly ? >>> >>> At this moment the Spark runner does not support some parts of the Beam >>> model in >>> streaming mode, e.g. side inputs and state/timer API. Comparing this with >>> pure >>> spark streaming is not easy given the semantic differences of Beam. >>> >>> > 2. Spark features like checkpointing, kafka offset management, how are >>> they supported in Apache Beam? Do we need to do some extra work for them? >>> >>> Checkpointing is supported, Kafka offset management (if I understand what >>> you >>> mean) is managed by the KafkaIO connector + the runner, so this should be >>> ok. >>> >>> > 3. with spark 2.x structured streaming , if we want to switch across >>> different modes like from micro-batching to continuous streaming mode, >>> how >>> it can be done while using Beam? >>> >>> To do this the Spark runner needs to translate the Beam Pipeline using >>> the >>> Structured Streaming API which is not the case today. It uses the RDD >>> based >>> API >>> but we expect to tackle this in the not so far future. However even if >>> we >>> did >>> Spark continuous mode is quite limited at this momen
Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner
On Wed, May 16, 2018 at 10:46 PM chandan prakash wrote: > Thanks Ismaël. > Your answers were quite useful for a novice user. > I guess this answer will help many like me. > > *Regarding your answer to point 2 :* > > > *"Checkpointing is supported, Kafka offset management (if I understand > whatyoumean) is managed by the KafkaIO connector + the runner"* > > Beam provides IO connectors like KafkaIO, etc. > >1. Does it mean that Beam code deals with fetching data from source >and writing to sink (after processing is done by chosen runner) ? >2. If ans of 1 is YES, then it means Beam is responsible for data >transfer between source/sink to processing machines . In such case, > underlying >runner like Spark is only responsible for doing processing from one >PCollection -> other PCollection and not for data transfer from/to >source/sink . >3. If ans of 1 is NO, then it means Beam is only giving abstraction >for IO as well but internally data transfer+ processing, both are done by >runner like Spark? > > Can you please correct me which assumption is right, point 2 or point3. > Feel free to add anything else which I might missed in understanding. > > Yes, Apache Beam code is responsible for reading from sources and writing to sinks (typically sinks are a set of PTransforms). Sources are slightly more complicated because a runner interacts with a user written source through interfaces like BoundedSource and UnboundedSource to support progress reporting, splitting, checkpointing, This means that all runners (if they abide by the source contracts) support all user written sources. The Apache Beam sources are "user" written sources that implement either the BoundedSource or UnboundedSource interface. > Also, > It will be very useful if in the following sample MinimalWordCount > program, it can be clearly pointed out which part is executed by Beam and > which part is executed by runner : > > PipelineOptions options = PipelineOptionsFactory.create(); > > Pipeline p = Pipeline.create(options); > > p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*")) > > .apply(FlatMapElements > .into(TypeDescriptors.strings()) > .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+" > > .apply(Filter.by((String word) -> !word.isEmpty())) > > .apply(Count.perElement()) > > .apply(MapElements > .into(TypeDescriptors.strings()) > .via((KV wordCount) -> wordCount.getKey() + ": " + > wordCount.getValue())) > > .apply(TextIO.write().to("chandan")); > > p.run().waitUntilFinish(); > > > As discussed above, runners are required to fulfill the BoundedSource contract to power TextIO. Runners are also responsible for making sure elements produced by a PTransform are able to be consumed by the next PTransform (e.g. words that are sent to the filter function that are output by the filter function are able to be consumed by the count). And finally, runners are responsible for implementing GroupByKey (the Count.perElement transform is composed of a GroupByKey followed by a combiner). Runners are also responsible for the lifecycle of the job, (e.g. distributing your application to a cluster of machines and managing execution across that cluster). > > Thanks & Regards, > Chandan > > > > > > > > > On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía wrote: > >> Hello, >> >> Answers to the questions inline: >> >> > 1. Are there any limitations in terms of implementations, >> functionalities >> or performance if we want to run streaming on Beam with Spark runner vs >> streaming on Spark-Streaming directly ? >> >> At this moment the Spark runner does not support some parts of the Beam >> model in >> streaming mode, e.g. side inputs and state/timer API. Comparing this with >> pure >> spark streaming is not easy given the semantic differences of Beam. >> >> > 2. Spark features like checkpointing, kafka offset management, how are >> they supported in Apache Beam? Do we need to do some extra work for them? >> >> Checkpointing is supported, Kafka offset management (if I understand what >> you >> mean) is managed by the KafkaIO connector + the runner, so this should be >> ok. >> >> > 3. with spark 2.x structured streaming , if we want to switch across >> different modes like from micro-batching to continuous streaming mode, how >> it can be done while using Beam? >> >> To do this the Spark runner needs to translate the Beam Pipeline using the >> Structured Streaming API which is not the case today. It uses the RDD >> based >> API >> but we expect to tackle this in the not so far future. However even if we >> did >> Spark continuous mode is quite limited at this moment in time because it >> does >> not support aggregation functions. >> >> >> https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing >> >> Don't hesitate to give a try to Beam and the Spark runner and refer us if >> you >> have questions or find any issues. >> >> Regards
Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner
Thanks Ismaël. Your answers were quite useful for a novice user. I guess this answer will help many like me. *Regarding your answer to point 2 :* *"Checkpointing is supported, Kafka offset management (if I understand whatyoumean) is managed by the KafkaIO connector + the runner"* Beam provides IO connectors like KafkaIO, etc. 1. Does it mean that Beam code deals with fetching data from source and writing to sink (after processing is done by chosen runner) ? 2. If ans of 1 is YES, then it means Beam is responsible for data transfer between source/sink to processing machines . In such case, underlying runner like Spark is only responsible for doing processing from one PCollection -> other PCollection and not for data transfer from/to source/sink . 3. If ans of 1 is NO, then it means Beam is only giving abstraction for IO as well but internally data transfer+ processing, both are done by runner like Spark? Can you please correct me which assumption is right, point 2 or point3. Feel free to add anything else which I might missed in understanding. Also, It will be very useful if in the following sample MinimalWordCount program, it can be clearly pointed out which part is executed by Beam and which part is executed by runner : PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*")) .apply(FlatMapElements .into(TypeDescriptors.strings()) .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+" .apply(Filter.by((String word) -> !word.isEmpty())) .apply(Count.perElement()) .apply(MapElements .into(TypeDescriptors.strings()) .via((KV wordCount) -> wordCount.getKey() + ": " + wordCount.getValue())) .apply(TextIO.write().to("chandan")); p.run().waitUntilFinish(); Thanks & Regards, Chandan On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía wrote: > Hello, > > Answers to the questions inline: > > > 1. Are there any limitations in terms of implementations, functionalities > or performance if we want to run streaming on Beam with Spark runner vs > streaming on Spark-Streaming directly ? > > At this moment the Spark runner does not support some parts of the Beam > model in > streaming mode, e.g. side inputs and state/timer API. Comparing this with > pure > spark streaming is not easy given the semantic differences of Beam. > > > 2. Spark features like checkpointing, kafka offset management, how are > they supported in Apache Beam? Do we need to do some extra work for them? > > Checkpointing is supported, Kafka offset management (if I understand what > you > mean) is managed by the KafkaIO connector + the runner, so this should be > ok. > > > 3. with spark 2.x structured streaming , if we want to switch across > different modes like from micro-batching to continuous streaming mode, how > it can be done while using Beam? > > To do this the Spark runner needs to translate the Beam Pipeline using the > Structured Streaming API which is not the case today. It uses the RDD based > API > but we expect to tackle this in the not so far future. However even if we > did > Spark continuous mode is quite limited at this moment in time because it > does > not support aggregation functions. > > https://spark.apache.org/docs/2.3.0/structured-streaming- > programming-guide.html#continuous-processing > > Don't hesitate to give a try to Beam and the Spark runner and refer us if > you > have questions or find any issues. > > Regards, > Ismaël > > On Tue, May 15, 2018 at 2:22 PM chandan prakash > > wrote: > > > Also, > > > 3. with spark 2.x structured streaming , if we want to switch across > different modes like from micro-batching to continuous streaming mode, how > it can be done while using Beam? > > > These are some of the initial questions which I am not able to understand > currently. > > > > Regards, > > Chandan > > > On Tue, May 15, 2018 at 5:45 PM, chandan prakash < > chandanbaran...@gmail.com> wrote: > > >> Hi Everyone, > >> I have just started exploring and understanding Apache Beam for new > project in my firm. > >> In particular, we have to take decision whether to implement our product > over spark streaming (as spark batch is already in our eco system) or > should we use Beam over spark runner to have future liberty of changing > underline runner. > > >> Couple of questions, after going through beam docs and examples, I have > is: > > >> Are there any limitations in terms of implementations, functionalities > or performance if we want to run streaming on Beam with Spark runner vs > streaming on Spark-Streaming directly ? > > >> Spark features like checkpointing, kafka offset management, how are they > supported in Apache Beam? Do we need to do some extra work for them? > > > >> Any answer or link to like wise discussion will be really appreciable. > >> Thanks in advance. > > >> Regards, > >> -- > >> Chandan Prakash > > > > > > -- > > Chanda
Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner
Hello, Answers to the questions inline: > 1. Are there any limitations in terms of implementations, functionalities or performance if we want to run streaming on Beam with Spark runner vs streaming on Spark-Streaming directly ? At this moment the Spark runner does not support some parts of the Beam model in streaming mode, e.g. side inputs and state/timer API. Comparing this with pure spark streaming is not easy given the semantic differences of Beam. > 2. Spark features like checkpointing, kafka offset management, how are they supported in Apache Beam? Do we need to do some extra work for them? Checkpointing is supported, Kafka offset management (if I understand what you mean) is managed by the KafkaIO connector + the runner, so this should be ok. > 3. with spark 2.x structured streaming , if we want to switch across different modes like from micro-batching to continuous streaming mode, how it can be done while using Beam? To do this the Spark runner needs to translate the Beam Pipeline using the Structured Streaming API which is not the case today. It uses the RDD based API but we expect to tackle this in the not so far future. However even if we did Spark continuous mode is quite limited at this moment in time because it does not support aggregation functions. https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing Don't hesitate to give a try to Beam and the Spark runner and refer us if you have questions or find any issues. Regards, Ismaël On Tue, May 15, 2018 at 2:22 PM chandan prakash wrote: > Also, > 3. with spark 2.x structured streaming , if we want to switch across different modes like from micro-batching to continuous streaming mode, how it can be done while using Beam? > These are some of the initial questions which I am not able to understand currently. > Regards, > Chandan > On Tue, May 15, 2018 at 5:45 PM, chandan prakash < chandanbaran...@gmail.com> wrote: >> Hi Everyone, >> I have just started exploring and understanding Apache Beam for new project in my firm. >> In particular, we have to take decision whether to implement our product over spark streaming (as spark batch is already in our eco system) or should we use Beam over spark runner to have future liberty of changing underline runner. >> Couple of questions, after going through beam docs and examples, I have is: >> Are there any limitations in terms of implementations, functionalities or performance if we want to run streaming on Beam with Spark runner vs streaming on Spark-Streaming directly ? >> Spark features like checkpointing, kafka offset management, how are they supported in Apache Beam? Do we need to do some extra work for them? >> Any answer or link to like wise discussion will be really appreciable. >> Thanks in advance. >> Regards, >> -- >> Chandan Prakash > -- > Chandan Prakash
Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner
Also, 3. with spark 2.x structured streaming , if we want to switch across different modes like from micro-batching to continuous streaming mode, how it can be done while using Beam? These are some of the initial questions which I am not able to understand currently. Regards, Chandan On Tue, May 15, 2018 at 5:45 PM, chandan prakash wrote: > Hi Everyone, > I have just started exploring and understanding Apache Beam for new > project in my firm. > In particular, we have to take decision whether to implement our product > over spark streaming (as spark batch is already in our eco system) or > should we use Beam over spark runner to have future liberty of changing > underline runner. > > Couple of questions, after going through beam docs and examples, I have is: > > >1. Are there any limitations in terms of implementations, >functionalities or performance if we want to run streaming on Beam with >Spark runner vs streaming on Spark-Streaming directly ? > >2. Spark features like checkpointing, kafka offset management, how are >they supported in Apache Beam? Do we need to do some extra work for them? > > > Any answer or link to like wise discussion will be really appreciable. > Thanks in advance. > > Regards, > -- > Chandan Prakash > > -- Chandan Prakash
Normal Spark Streaming vs Streaming on Beam with Spark Runner
Hi Everyone, I have just started exploring and understanding Apache Beam for new project in my firm. In particular, we have to take decision whether to implement our product over spark streaming (as spark batch is already in our eco system) or should we use Beam over spark runner to have future liberty of changing underline runner. Couple of questions, after going through beam docs and examples, I have is: 1. Are there any limitations in terms of implementations, functionalities or performance if we want to run streaming on Beam with Spark runner vs streaming on Spark-Streaming directly ? 2. Spark features like checkpointing, kafka offset management, how are they supported in Apache Beam? Do we need to do some extra work for them? Any answer or link to like wise discussion will be really appreciable. Thanks in advance. Regards, -- Chandan Prakash