Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread Eleanore Jin
Hi Maxi, I assume this will impact the Exactly Once Semantics that beam provided as in the KafkaExactlyOnceSink, the processElement method is also annotated with @RequiresStableInput? Thanks a lot! Eleanore On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels wrote: > Hi Stephen, > > Thanks

Re: Running NexMark Tests

2020-04-21 Thread Kenneth Knowles
We should always want to shut down sources on final watermark. All incoming data should be dropped anyhow. Kenn On Tue, Apr 21, 2020 at 1:34 PM Luke Cwik wrote: > +dev > > When would we not want --shutdownSourcesOnFinalWatermark=true ? > > On Tue, Apr 21, 2020 at 1:22 PM Ismaël Mejía wrote: >

Re: Recommended Reading for Apache Beam

2020-04-21 Thread Joshua Bassett
Thanks for the recommendation, Rion. I bought a copy yesterday. On Wed, 22 Apr 2020, at 1:45 PM, Kenneth Knowles wrote: > I believe Streaming Systems is the most Beam-oriented book available. > > Kenn > > On Mon, Apr 20, 2020 at 3:07 PM Rion Williams wrote: >> Hi all, >> >> I posed this

Re: Recommended Reading for Apache Beam

2020-04-21 Thread Kenneth Knowles
I believe Streaming Systems is the most Beam-oriented book available. Kenn On Mon, Apr 20, 2020 at 3:07 PM Rion Williams wrote: > Hi all, > > I posed this question over on the Apache Slack Community however didn't > get much of a response so I thought I'd reach out here. I've been looking >

Re: Kafka IO: value of expansion_service

2020-04-21 Thread Chamikara Jayalath
On Tue, Apr 21, 2020 at 12:43 PM Piotr Filipiuk wrote: > Hi, > > I would like to know whether it is possible to run a streaming pipeline > that reads from (or writes to) Kafka using DirectRunner? If so, what should > the expansion_service point to: >

Re: Running NexMark Tests

2020-04-21 Thread Sruthi Sree Kumar
Thank you. It worked with the argument '--shutdownSourcesOnFinalWatermark=true' I will open PR to update the documentation. Regards, Sruthi On 2020/04/21 20:22:17, Ismaël Mejía wrote: > You need to instruct the Flink runner to shutdown the the source > otherwise it will stay waiting. > You can

Re: Running NexMark Tests

2020-04-21 Thread Luke Cwik
+dev When would we not want --shutdownSourcesOnFinalWatermark=true ? On Tue, Apr 21, 2020 at 1:22 PM Ismaël Mejía wrote: > You need to instruct the Flink runner to shutdown the the source > otherwise it will stay waiting. > You can this by adding the extra >

Re: Running NexMark Tests

2020-04-21 Thread Ismaël Mejía
You need to instruct the Flink runner to shutdown the the source otherwise it will stay waiting. You can this by adding the extra argument`--shutdownSourcesOnFinalWatermark=true` And if that works and you want to open a PR to update our documentation that would be greatly appreciated. Regards,

Running NexMark Tests

2020-04-21 Thread Sruthi Sree Kumar
Hello, I am trying to run nexmark queries using flink runner streaming. Followed the documentation and used the command ./gradlew :sdks:java:testing:nexmark:run \ -Pnexmark.runner=":runners:flink:1.10" \ -Pnexmark.args=" --runner=FlinkRunner --suite=SMOKE

Kafka IO: value of expansion_service

2020-04-21 Thread Piotr Filipiuk
Hi, I would like to know whether it is possible to run a streaming pipeline that reads from (or writes to) Kafka using DirectRunner? If so, what should the expansion_service point to: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/external/kafka.py#L90? Also, when using

Re: Large public Beam projects?

2020-04-21 Thread Tim Robertson
My apologies, I missed the link: [1] https://github.com/gbif/pipelines On Tue, Apr 21, 2020 at 5:58 PM Tim Robertson wrote: > Hi Jordan > > I don't know if we qualify as a large Beam project but at GBIF.org we > bring together datasets from 1600+ institutions documenting 1,4B > observations of

Re: Large public Beam projects?

2020-04-21 Thread Tim Robertson
Hi Jordan I don't know if we qualify as a large Beam project but at GBIF.org we bring together datasets from 1600+ institutions documenting 1,4B observations of species (museum data, citizen science, environmental reports etc). As far as Beam goes though, we aren't using the most advanced

Re: Distributed Tracing in Apache Beam

2020-04-21 Thread Kenneth Knowles
+dev I don't have a ton of time to dig in to this, but I wanted to say that this is very cool and just drop a couple pointers (which you may already know about) like Explaining Outputs in Modern Data Analytics [1] which was covered by The Morning Paper [2]. This just happens to be something I

Re: Large public Beam projects?

2020-04-21 Thread Jeff Klukas
Mozilla hosts the code for our data ingestion system publicly on GitHub. A good chunk of that architecture consists of Beam pipelines running on Dataflow. See: https://github.com/mozilla/gcp-ingestion/tree/master/ingestion-beam and rendered usage documentation at:

Re: Copying tar.gz libraries to apache-beam workers

2020-04-21 Thread OrielResearch Eila Arich-Landkof
Thank you. I should have realized this flag. Stay safe Best, Eila On Mon, Apr 20, 2020 at 12:42 PM Luke Cwik wrote: > It looks like anaconda assumes that the directory doesn't exist before > installation. I would recommend using a new directory such as > ["/opt/userowned/anaconda.sh",

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread Maximilian Michels
Hi Stephen, Thanks for reporting the issue! David, good catch! I think we have to resort to only using a single state cell for buffering on checkpoints, instead of using a new one for every checkpoint. I was under the assumption that, if the state cell was cleared, it would not be checkpointed

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread David Morávek
Hi Stephen, nice catch and awesome report! ;) This definitely needs a proper fix. I've created a new JIRA to track the issue and will try to resolve it soon as this seems critical to me. https://issues.apache.org/jira/browse/BEAM-9794 Thanks, D. On Mon, Apr 20, 2020 at 10:41 PM Stephen Patel