[Question] enable end to end Kafka Exactly once processing

2020-03-02 Thread Jin Yi
Hi experts, My application is using Apache Beam and with Flink to be the runner. My source and sink are kafka topics, and I am using KafkaIO connector provided by Apache Beam to consume and publish. I am reading through Beam's java doc:

Re: GCS numShards doubt

2020-03-02 Thread Kenneth Knowles
For bounded data, each bundle becomes a file: https://github.com/apache/beam/blob/da9e17288e8473925674a4691d9e86252e67d7d7/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java#L356 Kenn On Mon, Mar 2, 2020 at 6:18 PM Kyle Weaver wrote: > As Luke and Robert indicated, unsetting

Re: GCS numShards doubt

2020-03-02 Thread Kyle Weaver
As Luke and Robert indicated, unsetting num shards _may_ cause the runner to optimize it automatically. For example, the Flink [1] and Dataflow [2] runners override num shards. However, in the Spark runner, I don't see any such override. So I have two questions: 1. Does the Spark runner override

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Maximilian Michels
Hi Tobias, You are right, for this use case where Kafka commits the offset and you do not use any other stateful sources/operators you will get exactly-once. Should you have stateful operators that may not hold true anymore. Only a checkpoint constructs a consistent view of the pipeline.

Re: Splittable DoFn and Dataflow, "conflicting bucketing functions"

2020-03-02 Thread Luke Cwik
SplittableDoFn has experimental support within Dataflow so the way you may be using it could be correct but unsupported. Can you provide snippets/details of your splittable dofn implementation? On Mon, Mar 2, 2020 at 11:50 AM Kjetil Halvorsen < kjetil.halvor...@cognite.com> wrote: > > Hi, > > I

Splittable DoFn and Dataflow, "conflicting bucketing functions"

2020-03-02 Thread Kjetil Halvorsen
Hi, I am looking for pointers to a Dataflow runner error message: Workflow failed. Causes: Step s22 has conflicting bucketing functions, This happens at the very startup of the job execution, and I am unable to find any pointer as to where in the code/job definition the origin of the conflict

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Maximilian Michels
Hi Tobi, That makes sense to me. My argument was coming from having "exactly-once" semantics for a pipeline. In this regard, the stop functionality does not help. But I think having the option to gracefully shut down a pipeline is beneficial for other uses cases like the ones you described.

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Kaymak, Tobias
Good morning Max and thanks for clarifying! I generated the JAR 2.19.0 in the second test via the default demo code from Beam. There were no further adjustments from my side, but as I can see there are some open points in JIRA for 1.9.2, so for now I think that we can focus on 1.9.1 as a target.