Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-18 Thread chandan prakash
Thanks  Lukasz. It was really helpful to understand.

Regards,
Chandan

On Thu, May 17, 2018 at 8:26 PM, Lukasz Cwik  wrote:

>
>
> On Wed, May 16, 2018 at 10:46 PM chandan prakash <
> chandanbaran...@gmail.com> wrote:
>
>> Thanks Ismaël.
>> Your answers were quite useful for a novice user.
>> I guess this answer will help many like me.
>>
>> *Regarding your answer to point 2 :*
>>
>>
>> *"Checkpointing is supported, Kafka offset management (if I understand
>> whatyoumean) is managed by the KafkaIO connector + the runner"*
>>
>> Beam provides IO connectors like KafkaIO, etc.
>>
>>1. Does it mean that Beam code deals with fetching data from source
>>and writing to sink (after processing is done by chosen runner) ?
>>2. If ans of 1 is YES, then it means Beam is responsible for data
>>transfer between source/sink to processing machines . In such case, 
>> underlying
>>runner like Spark is only responsible for doing processing from one
>>PCollection -> other PCollection and not for data transfer from/to
>>source/sink .
>>3. If ans of 1 is NO, then it means Beam is only giving abstraction
>>for IO as well but internally data transfer+ processing, both are done by
>>runner like Spark?
>>
>> Can you please correct me which assumption is right, point 2 or point3.
>> Feel free to add anything else which I might missed in understanding.
>>
>> Yes, Apache Beam code is responsible for reading from sources and writing
> to sinks (typically sinks are a set of PTransforms). Sources are slightly
> more complicated because a runner interacts with a user written source
> through interfaces like BoundedSource and UnboundedSource to support
> progress reporting, splitting, checkpointing,  This means that all
> runners (if they abide by the source contracts) support all user written
> sources. The Apache Beam sources are "user" written sources that implement
> either the BoundedSource or UnboundedSource interface.
>
>
>> Also,
>> It will be very useful if in the following sample MinimalWordCount
>> program, it can be clearly pointed out which part is executed by Beam and
>> which part is executed by runner :
>>
>> PipelineOptions options = PipelineOptionsFactory.create();
>>
>> Pipeline p = Pipeline.create(options);
>>
>> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
>>
>> .apply(FlatMapElements
>> .into(TypeDescriptors.strings())
>> .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"
>>
>> .apply(Filter.by((String word) -> !word.isEmpty()))
>>
>> .apply(Count.perElement())
>>
>> .apply(MapElements
>> .into(TypeDescriptors.strings())
>> .via((KV wordCount) -> wordCount.getKey() + ": " + 
>> wordCount.getValue()))
>>
>> .apply(TextIO.write().to("chandan"));
>>
>> p.run().waitUntilFinish();
>>
>>
>> As discussed above, runners are required to fulfill the BoundedSource
> contract to power TextIO. Runners are also responsible for making sure
> elements produced by a PTransform are able to be consumed by the next
> PTransform (e.g. words that are sent to the filter function that are output
> by the filter function are able to be consumed by the count). And finally,
> runners are responsible for implementing GroupByKey (the Count.perElement
> transform is composed of a GroupByKey followed by a combiner). Runners are
> also responsible for the lifecycle of the job, (e.g. distributing your
> application to a cluster of machines and managing execution across that
> cluster).
>
>
>>
>> Thanks & Regards,
>> Chandan
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía  wrote:
>>
>>> Hello,
>>>
>>> Answers to the questions inline:
>>>
>>> > 1. Are there any limitations in terms of implementations,
>>> functionalities
>>> or performance if we want to run streaming on Beam with Spark runner vs
>>> streaming on Spark-Streaming directly ?
>>>
>>> At this moment the Spark runner does not support some parts of the Beam
>>> model in
>>> streaming mode, e.g. side inputs and state/timer API. Comparing this with
>>> pure
>>> spark streaming is not easy given the semantic differences of Beam.
>>>
>>> > 2. Spark features like checkpointing, kafka offset management, how are
>>> they supported in Apache Beam? Do we need to do some extra work for them?
>>>
>>> Checkpointing is supported, Kafka offset management (if I understand what
>>> you
>>> mean) is managed by the KafkaIO connector + the runner, so this should be
>>> ok.
>>>
>>> > 3. with spark 2.x structured streaming , if we want to switch across
>>> different modes like from micro-batching to continuous streaming mode,
>>> how
>>> it can be done while using Beam?
>>>
>>> To do this the Spark runner needs to translate the Beam Pipeline using
>>> the
>>> Structured Streaming API which is not the case today. It uses the RDD
>>> based
>>> API
>>> but we expect to tackle this in the not so far future.  However even if
>>> we
>>> did
>>> Spark continuous mode is quite limited at this momen

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-17 Thread Lukasz Cwik
On Wed, May 16, 2018 at 10:46 PM chandan prakash 
wrote:

> Thanks Ismaël.
> Your answers were quite useful for a novice user.
> I guess this answer will help many like me.
>
> *Regarding your answer to point 2 :*
>
>
> *"Checkpointing is supported, Kafka offset management (if I understand
> whatyoumean) is managed by the KafkaIO connector + the runner"*
>
> Beam provides IO connectors like KafkaIO, etc.
>
>1. Does it mean that Beam code deals with fetching data from source
>and writing to sink (after processing is done by chosen runner) ?
>2. If ans of 1 is YES, then it means Beam is responsible for data
>transfer between source/sink to processing machines . In such case, 
> underlying
>runner like Spark is only responsible for doing processing from one
>PCollection -> other PCollection and not for data transfer from/to
>source/sink .
>3. If ans of 1 is NO, then it means Beam is only giving abstraction
>for IO as well but internally data transfer+ processing, both are done by
>runner like Spark?
>
> Can you please correct me which assumption is right, point 2 or point3.
> Feel free to add anything else which I might missed in understanding.
>
> Yes, Apache Beam code is responsible for reading from sources and writing
to sinks (typically sinks are a set of PTransforms). Sources are slightly
more complicated because a runner interacts with a user written source
through interfaces like BoundedSource and UnboundedSource to support
progress reporting, splitting, checkpointing,  This means that all
runners (if they abide by the source contracts) support all user written
sources. The Apache Beam sources are "user" written sources that implement
either the BoundedSource or UnboundedSource interface.


> Also,
> It will be very useful if in the following sample MinimalWordCount
> program, it can be clearly pointed out which part is executed by Beam and
> which part is executed by runner :
>
> PipelineOptions options = PipelineOptionsFactory.create();
>
> Pipeline p = Pipeline.create(options);
>
> p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))
>
> .apply(FlatMapElements
> .into(TypeDescriptors.strings())
> .via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"
>
> .apply(Filter.by((String word) -> !word.isEmpty()))
>
> .apply(Count.perElement())
>
> .apply(MapElements
> .into(TypeDescriptors.strings())
> .via((KV wordCount) -> wordCount.getKey() + ": " + 
> wordCount.getValue()))
>
> .apply(TextIO.write().to("chandan"));
>
> p.run().waitUntilFinish();
>
>
> As discussed above, runners are required to fulfill the BoundedSource
contract to power TextIO. Runners are also responsible for making sure
elements produced by a PTransform are able to be consumed by the next
PTransform (e.g. words that are sent to the filter function that are output
by the filter function are able to be consumed by the count). And finally,
runners are responsible for implementing GroupByKey (the Count.perElement
transform is composed of a GroupByKey followed by a combiner). Runners are
also responsible for the lifecycle of the job, (e.g. distributing your
application to a cluster of machines and managing execution across that
cluster).


>
> Thanks & Regards,
> Chandan
>
>
>
>
>
>
>
>
> On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía  wrote:
>
>> Hello,
>>
>> Answers to the questions inline:
>>
>> > 1. Are there any limitations in terms of implementations,
>> functionalities
>> or performance if we want to run streaming on Beam with Spark runner vs
>> streaming on Spark-Streaming directly ?
>>
>> At this moment the Spark runner does not support some parts of the Beam
>> model in
>> streaming mode, e.g. side inputs and state/timer API. Comparing this with
>> pure
>> spark streaming is not easy given the semantic differences of Beam.
>>
>> > 2. Spark features like checkpointing, kafka offset management, how are
>> they supported in Apache Beam? Do we need to do some extra work for them?
>>
>> Checkpointing is supported, Kafka offset management (if I understand what
>> you
>> mean) is managed by the KafkaIO connector + the runner, so this should be
>> ok.
>>
>> > 3. with spark 2.x structured streaming , if we want to switch across
>> different modes like from micro-batching to continuous streaming mode, how
>> it can be done while using Beam?
>>
>> To do this the Spark runner needs to translate the Beam Pipeline using the
>> Structured Streaming API which is not the case today. It uses the RDD
>> based
>> API
>> but we expect to tackle this in the not so far future.  However even if we
>> did
>> Spark continuous mode is quite limited at this moment in time because it
>> does
>> not support aggregation functions.
>>
>>
>> https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing
>>
>> Don't hesitate to give a try to Beam and the Spark runner and refer us if
>> you
>> have questions or find any issues.
>>
>> Regards

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-16 Thread chandan prakash
Thanks Ismaël.
Your answers were quite useful for a novice user.
I guess this answer will help many like me.

*Regarding your answer to point 2 :*


*"Checkpointing is supported, Kafka offset management (if I understand
whatyoumean) is managed by the KafkaIO connector + the runner"*

Beam provides IO connectors like KafkaIO, etc.

   1. Does it mean that Beam code deals with fetching data from source and
   writing to sink (after processing is done by chosen runner) ?
   2. If ans of 1 is YES, then it means Beam is responsible for data
   transfer between source/sink to processing machines . In such case,
underlying
   runner like Spark is only responsible for doing processing from one
   PCollection -> other PCollection and not for data transfer from/to
   source/sink .
   3. If ans of 1 is NO, then it means Beam is only giving abstraction for
   IO as well but internally data transfer+ processing, both are done by
   runner like Spark?

Can you please correct me which assumption is right, point 2 or point3.
Feel free to add anything else which I might missed in understanding.

Also,
It will be very useful if in the following sample MinimalWordCount program,
it can be clearly pointed out which part is executed by Beam and which part
is executed by runner :

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline p = Pipeline.create(options);

p.apply(TextIO.read().from("gs://apache-beam-samples/shakespeare/*"))

.apply(FlatMapElements
.into(TypeDescriptors.strings())
.via((String word) -> Arrays.asList(word.split("[^\\p{L}]+"

.apply(Filter.by((String word) -> !word.isEmpty()))

.apply(Count.perElement())

.apply(MapElements
.into(TypeDescriptors.strings())
.via((KV wordCount) -> wordCount.getKey() + ": " +
wordCount.getValue()))

.apply(TextIO.write().to("chandan"));

p.run().waitUntilFinish();



Thanks & Regards,
Chandan








On Wed, May 16, 2018 at 8:40 PM, Ismaël Mejía  wrote:

> Hello,
>
> Answers to the questions inline:
>
> > 1. Are there any limitations in terms of implementations, functionalities
> or performance if we want to run streaming on Beam with Spark runner vs
> streaming on Spark-Streaming directly ?
>
> At this moment the Spark runner does not support some parts of the Beam
> model in
> streaming mode, e.g. side inputs and state/timer API. Comparing this with
> pure
> spark streaming is not easy given the semantic differences of Beam.
>
> > 2. Spark features like checkpointing, kafka offset management, how are
> they supported in Apache Beam? Do we need to do some extra work for them?
>
> Checkpointing is supported, Kafka offset management (if I understand what
> you
> mean) is managed by the KafkaIO connector + the runner, so this should be
> ok.
>
> > 3. with spark 2.x structured streaming , if we want to switch across
> different modes like from micro-batching to continuous streaming mode, how
> it can be done while using Beam?
>
> To do this the Spark runner needs to translate the Beam Pipeline using the
> Structured Streaming API which is not the case today. It uses the RDD based
> API
> but we expect to tackle this in the not so far future.  However even if we
> did
> Spark continuous mode is quite limited at this moment in time because it
> does
> not support aggregation functions.
>
> https://spark.apache.org/docs/2.3.0/structured-streaming-
> programming-guide.html#continuous-processing
>
> Don't hesitate to give a try to Beam and the Spark runner and refer us if
> you
> have questions or find any issues.
>
> Regards,
> Ismaël
>
> On Tue, May 15, 2018 at 2:22 PM chandan prakash  >
> wrote:
>
> > Also,
>
> > 3. with spark 2.x structured streaming , if we want to switch across
> different modes like from micro-batching to continuous streaming mode, how
> it can be done while using Beam?
>
> > These are some of the initial questions which I am not able to understand
> currently.
>
>
> > Regards,
> > Chandan
>
> > On Tue, May 15, 2018 at 5:45 PM, chandan prakash <
> chandanbaran...@gmail.com> wrote:
>
> >> Hi Everyone,
> >> I have just started exploring and understanding Apache Beam for new
> project in my firm.
> >> In particular, we have to take decision whether to implement our product
> over spark streaming (as spark batch is already in our eco system) or
> should we use Beam over spark runner to have future liberty of changing
> underline runner.
>
> >> Couple of questions, after going through beam docs and examples, I have
> is:
>
> >> Are there any limitations in terms of implementations, functionalities
> or performance if we want to run streaming on Beam with Spark runner vs
> streaming on Spark-Streaming directly ?
>
> >> Spark features like checkpointing, kafka offset management, how are they
> supported in Apache Beam? Do we need to do some extra work for them?
>
>
> >> Any answer or link to like wise discussion will be really appreciable.
> >> Thanks in advance.
>
> >> Regards,
> >> --
> >> Chandan Prakash
>
>
>
>
> > --
> > Chanda

Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-16 Thread Ismaël Mejía
Hello,

Answers to the questions inline:

> 1. Are there any limitations in terms of implementations, functionalities
or performance if we want to run streaming on Beam with Spark runner vs
streaming on Spark-Streaming directly ?

At this moment the Spark runner does not support some parts of the Beam
model in
streaming mode, e.g. side inputs and state/timer API. Comparing this with
pure
spark streaming is not easy given the semantic differences of Beam.

> 2. Spark features like checkpointing, kafka offset management, how are
they supported in Apache Beam? Do we need to do some extra work for them?

Checkpointing is supported, Kafka offset management (if I understand what
you
mean) is managed by the KafkaIO connector + the runner, so this should be
ok.

> 3. with spark 2.x structured streaming , if we want to switch across
different modes like from micro-batching to continuous streaming mode, how
it can be done while using Beam?

To do this the Spark runner needs to translate the Beam Pipeline using the
Structured Streaming API which is not the case today. It uses the RDD based
API
but we expect to tackle this in the not so far future.  However even if we
did
Spark continuous mode is quite limited at this moment in time because it
does
not support aggregation functions.

https://spark.apache.org/docs/2.3.0/structured-streaming-programming-guide.html#continuous-processing

Don't hesitate to give a try to Beam and the Spark runner and refer us if
you
have questions or find any issues.

Regards,
Ismaël

On Tue, May 15, 2018 at 2:22 PM chandan prakash 
wrote:

> Also,

> 3. with spark 2.x structured streaming , if we want to switch across
different modes like from micro-batching to continuous streaming mode, how
it can be done while using Beam?

> These are some of the initial questions which I am not able to understand
currently.


> Regards,
> Chandan

> On Tue, May 15, 2018 at 5:45 PM, chandan prakash <
chandanbaran...@gmail.com> wrote:

>> Hi Everyone,
>> I have just started exploring and understanding Apache Beam for new
project in my firm.
>> In particular, we have to take decision whether to implement our product
over spark streaming (as spark batch is already in our eco system) or
should we use Beam over spark runner to have future liberty of changing
underline runner.

>> Couple of questions, after going through beam docs and examples, I have
is:

>> Are there any limitations in terms of implementations, functionalities
or performance if we want to run streaming on Beam with Spark runner vs
streaming on Spark-Streaming directly ?

>> Spark features like checkpointing, kafka offset management, how are they
supported in Apache Beam? Do we need to do some extra work for them?


>> Any answer or link to like wise discussion will be really appreciable.
>> Thanks in advance.

>> Regards,
>> --
>> Chandan Prakash




> --
> Chandan Prakash


Re: Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-15 Thread chandan prakash
Also,

3. with spark 2.x structured streaming , if we want to switch across
different modes like from micro-batching to continuous streaming mode, how
it can be done while using Beam?

These are some of the initial questions which I am not able to understand
currently.


Regards,
Chandan

On Tue, May 15, 2018 at 5:45 PM, chandan prakash 
wrote:

> Hi Everyone,
> I have just started exploring and understanding Apache Beam for new
> project in my firm.
> In particular, we have to take decision whether to implement our product
> over spark streaming (as spark batch is already in our eco system) or
> should we use Beam over spark runner to have future liberty of changing
> underline runner.
>
> Couple of questions, after going through beam docs and examples, I have is:
>
>
>1. Are there any limitations in terms of implementations,
>functionalities or performance if we want to run streaming on Beam with
>Spark runner vs streaming on Spark-Streaming directly ?
>
>2. Spark features like checkpointing, kafka offset management, how are
>they supported in Apache Beam? Do we need to do some extra work for them?
>
>
> Any answer or link to like wise discussion will be really appreciable.
> Thanks in advance.
>
> Regards,
> --
> Chandan Prakash
>
>


-- 
Chandan Prakash


Normal Spark Streaming vs Streaming on Beam with Spark Runner

2018-05-15 Thread chandan prakash
Hi Everyone,
I have just started exploring and understanding Apache Beam for new project
in my firm.
In particular, we have to take decision whether to implement our product
over spark streaming (as spark batch is already in our eco system) or
should we use Beam over spark runner to have future liberty of changing
underline runner.

Couple of questions, after going through beam docs and examples, I have is:


   1. Are there any limitations in terms of implementations,
   functionalities or performance if we want to run streaming on Beam with
   Spark runner vs streaming on Spark-Streaming directly ?

   2. Spark features like checkpointing, kafka offset management, how are
   they supported in Apache Beam? Do we need to do some extra work for them?


Any answer or link to like wise discussion will be really appreciable.
Thanks in advance.

Regards,
-- 
Chandan Prakash