Re: Designing an existing pipeline in Beam

2020-06-23 Thread Luke Cwik
You can apply the same DoFn / Transform instance multiple times in the graph or you can follow regular development practices where the common code is factored into a method and two different DoFn's invoke it. On Tue, Jun 23, 2020 at 4:28 PM Praveen K Viswanathan < harish.prav...@gmail.com> wrote:

Webinar: Distributed Processing for Machine Learning Production Pipelines with Apache Beam

2020-06-23 Thread Aizhamal Nurmamat kyzy
Hi Beamsters! We have scheduled the last webinar on Beam Learning Month! Robert Crowe and Reza Rokni from Google will share how Beam is used to process large scale data for ML pipelines. If the subject is interesting to you, please register via this link:

Re: Designing an existing pipeline in Beam

2020-06-23 Thread Praveen K Viswanathan
Hi Luke - Thanks for the explanation. The limitation due to directed graph processing and the option of external storage clears most of the questions I had with respect to designing this pipeline. I do have one more scenario to clarify on this thread. If I had a certain piece of logic that I had

Re: Designing an existing pipeline in Beam

2020-06-23 Thread Luke Cwik
Beam is really about parallelizing the processing. Using a single DoFn that does everything is fine as long as the DoFn can process elements in parallel (e.g. upstream source produces lots of elements). Composing multiple DoFns is great for re-use and testing but it isn't strictly necessary. Also,

Re: KafkaIO Exactly once vs At least Once

2020-06-23 Thread Eleanore Jin
Hi Alexey, Thanks a lot for the information! I will give it a try. Regarding the checkpoint intervals, I think the Flink community suggested something between 3-5 minutes, I am not sure yet if the checkpoint interval can be in milliseconds? Currently, our beam pipeline is stateless, there is no

Re: Continuous Read pipeline

2020-06-23 Thread Chamikara Jayalath
On Fri, Jun 12, 2020 at 12:52 AM TAREK ALSALEH wrote: > Hi, > > I am using the Python SDK with Dataflow as my runner. I am looking at > implementing a streaming pipeline that will continuously monitor a GCS > bucket for incoming files and depending on the regex of the file, launch a > set of

Re: KafkaIO Exactly once vs At least Once

2020-06-23 Thread Alexey Romanenko
> On 23 Jun 2020, at 07:49, Eleanore Jin wrote: > > the KafkaIO EOS-sink you referred is: KafkaExactlyOnceSink? If so, the way I > understand to use is KafkaIO.write().withEOS(numPartitionsForSinkTopic, > "mySlinkGroupId"), reading from your response, do I need additionally > configure

Re: Flink/Portable Runner error on AWS EMR

2020-06-23 Thread Jesse Lord
Hi Max, The error message shows up in the flink web UI, so I think it must be reaching the cluster. For the portable runner the error is listed in the docker container output as well but I assume that is just receiving the error message from the flink cluster. Thanks, Jesse On 6/23/20,

Re: Flink/Portable Runner error on AWS EMR

2020-06-23 Thread Maximilian Michels
Hey Jesse, Could you share the context of the error? Where does it occur? In the client code or on the cluster? Cheers, Max On 22.06.20 18:01, Jesse Lord wrote: > I am trying to run the wordcount quickstart example on a flink cluster > on AWS EMR. Beam version 2.22, Flink 1.10. > >   > > I

Re: Error restoring Flink checkpoint

2020-06-23 Thread Maximilian Michels
> Yes, I agree that serializing coders into the checkpoint creates problems. > I'm wondering whether it is possible to serialize the coder URN + args > instead. I wish that was possible but Beam does not have a dedicated interface to snapshot serializers. It is not possible to restore

Re: Error restoring Flink checkpoint

2020-06-23 Thread Ivan San Jose
I don't really know, my knowledge about Beam source code is not so deep, but I managed to modify AvroCoder in order to store a string containing class name (and class parameters in case it was a parametrized class) instead of references to Class. But, as I said, AvroCoder is using AVRO's

Re: Error restoring Flink checkpoint

2020-06-23 Thread Reuven Lax
Yes, I agree that serializing coders into the checkpoint creates problems. I'm wondering whether it is possible to serialize the coder URN + args instead. On Mon, Jun 22, 2020 at 11:00 PM Ivan San Jose wrote: > Hi again, just replying here in case this could be useful for someone > as using

Re: Error restoring Flink checkpoint

2020-06-23 Thread Ivan San Jose
Hi again, just replying here in case this could be useful for someone as using Flink checkpoints on Beam is not realiable at all right now... Even I removed class references to the serialized object in AvroCoder, finally I couldn't make AvroCoder work as it is inferring schema using ReflectData