Re: Issues with python's external ReadFromPubSub

2020-10-30 Thread Maximilian Michels
We used to run legacy sources using the old-style Read translation. Changing it to SDF might have broken ReadFromPubSub. Could you check in the Flink jobs whether it uses the SDF code or the Read translation? For Read you should be seeing the UnboundedSourceWrapper. Looking at the code, there

Re: flink runner 1.10 checkpoint timeout issue

2020-09-18 Thread Maximilian Michels
This type of stack trace occurs when the downstream operator is blocked for some reason. Flink maintains a finite number of network buffers for each network channel. If the receiving downstream operator does not process incoming network buffers, the upstream operator blocks. This is also

Re: FlinkRunner Graphite metrics

2020-09-15 Thread Maximilian Michels
Which metrics specifically do you mean? Beam metrics (e.g. backlogBytes) or Flink metrics (e.g. numRecordsOut)? Quick look at the code (ReaderInvocationUtil) reveals that the Beam metrics should be reported correctly. The Flink native metrics are always reported, independently of the type of

Re: Design rational behind copying via serializing in flink runner

2020-09-07 Thread Maximilian Michels
Hey Teodor, Copying is the default behavior. This is tunable via the pipeline option 'objectReuse', i.e. 'objectReuse=true'. The option is disabled by default because users may not be aware of object reuse and recycle objects in their process functions which will have unexpected side

Re: Clearing states and timers in a Stateful Fn with Global Windows

2020-09-04 Thread Maximilian Michels
Thanks for reporting Gökhan! Please keep us updated. We'll likely merge the patch by the end of the week. -Max On 03.09.20 08:40, Gökhan Imral wrote: Thanks for the quick response. I tried with a fix applied build and can see that memory is much more stable. Gokhan On 2 Sep 2020, at 12:51

Re: Staged PIP package mysteriously ungzipped, non-installable inside the worker

2020-08-11 Thread Maximilian Michels
Looks like you ran into a bug. You could just run your program without specifying any arguments, since running with Python's FnApiRunner should be enough. Alternatively, how about trying to run the same pipeline with the FlinkRunner? Use: --runner=FlinkRunner and do not specify an endpoint.

Re: Program and registration for Beam Digital Summit

2020-07-29 Thread Maximilian Michels
Thanks Pedro! Great to see the program! This is going to be an exciting event. Forwarding to the dev mailing list, in case people didn't see this here. -Max On 29.07.20 20:25, Pedro Galvan wrote: Hello! Just a quick message to let everybody know that we have published the program for the

Re: KafkaUnboundedReader

2020-07-29 Thread Maximilian Michels
Hi Dinh, The check only de-duplicates in case the consumer processes the same offset multiple times. It ensures the offset is always increasing. If this has been fixed in Kafka, which the comment assumes, the condition will never be true. Which Kafka version are you using? -Max On

Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used

2020-07-28 Thread Maximilian Michels
implementations. -Original Message- From: Maximilian Michels Sent: Monday, July 27, 2020 3:04 PM To: user@beam.apache.org; Sunny, Mani Kolbe Subject: Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used CAUTION: This email originated from outside of D

Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used

2020-07-27 Thread Maximilian Michels
nerate batched outputs. Without ability to resume from a checkpoint, it will be reading entire stream every time. Regards, Mani -Original Message- From: Maximilian Michels mailto:m...@apache.org>> Sent: Tuesday, July 21, 2020 11:38 AM To:user@bea

Re: Unbounded sources unable to recover from checkpointMark when withMaxReadTime() is used

2020-07-21 Thread Maximilian Michels
Hi Mani, BoundedReadFromUnboundedSource was originally intended to be used in batch pipelines. In batch, runners typically do not perform checkpointing. In case of failures, they re-run the entire pipeline. Keep in mind that, even with checkpointing, reading for a finite time in the

Re: ReadFromKafka: UnsupportedOperationException: The ActiveBundle does not have a registered bundle checkpoint handler

2020-07-09 Thread Maximilian Michels
This used to be working but it appears @FinalizeBundle (which KafkaIO requires) was simply ignored for portable (Python) pipelines. It looks relatively easy to fix. -Max On 07.07.20 03:37, Luke Cwik wrote: The KafkaIO implementation relies on checkpointing to be able to update the last

Re: Beam supports Flink Async IO operator

2020-07-08 Thread Maximilian Michels
Just to clarify: We could make the AsnycIO operator also available in Beam but the operator has to be represented by a concept in Beam. Otherwise, there is no way to know when to produce it as part of the translation. On 08.07.20 11:53, Maximilian Michels wrote: Flink's AsycIO operator

Re: Beam supports Flink Async IO operator

2020-07-08 Thread Maximilian Michels
Flink's AsycIO operator is useful for processing io-bound operations, e.g. sending network requests. Like Luke mentioned, it is not available in Beam. -Max On 07.07.20 22:11, Luke Cwik wrote: Beam is a layer that sits on top of execution engines like Flink and provides its own programming

Re: Flink/Portable Runner error on AWS EMR

2020-06-24 Thread Maximilian Michels
output as > well but I assume that is just receiving the error message from the flink > cluster. > > Thanks, > Jesse > > On 6/23/20, 8:11 AM, "Maximilian Michels" wrote: > > Hey Jesse, > > Could you share the context of the error? Wh

Re: Flink/Portable Runner error on AWS EMR

2020-06-23 Thread Maximilian Michels
Hey Jesse, Could you share the context of the error? Where does it occur? In the client code or on the cluster? Cheers, Max On 22.06.20 18:01, Jesse Lord wrote: > I am trying to run the wordcount quickstart example on a flink cluster > on AWS EMR. Beam version 2.22, Flink 1.10. > >   > > I

Re: Error restoring Flink checkpoint

2020-06-23 Thread Maximilian Michels
a.lang.RuntimeException: java.lang.RuntimeException: >>>>>> java.lang.NoSuchMethodException: java.time.Instant.() >>>>>> at >>>>>> >>> org.apache.avro.specific.SpecificData.newInstance(SpecificData.ja >>>>>> va >

Re: Beam/Python/Flink: Unable to deserialize UnboundedSource for PubSub source

2020-06-17 Thread Maximilian Michels
You are using a proprietary connector which only works on Dataflow. You will have to use io.external.gcp.pubsub.ReadFromPubsub. PubSub support is experimental from Python. -Max On 09.06.20 06:40, Pradip Thachile wrote: > Quick update: this test code works just fine on Dataflow as well as the >

Re: Error restoring Flink checkpoint

2020-06-05 Thread Maximilian Michels
> >> Is not like that? >> >> Also I don't understand when you said that a reference to the Beam >> Coder is saved into the checkpoint, because the error I'm getting is >> referencing the java model class ("Caused by: >> java.io.InvalidClassException: internal.model.dimension.POJOMode

Re: Error restoring Flink checkpoint

2020-06-04 Thread Maximilian Michels
ting the AVRO schema >> using >> Java reflection, and then that generated schema is saved within the >> Flink checkpoint, right? >> >> On Wed, 2020-06-03 at 18:00 +0200, Maximilian Michels wrote: >>> Hi Ivan, >>> >>> Moving to the new type serializer

Re: Error restoring Flink checkpoint

2020-06-04 Thread Maximilian Michels
ing the AVRO schema using > Java reflection, and then that generated schema is saved within the > Flink checkpoint, right? > > On Wed, 2020-06-03 at 18:00 +0200, Maximilian Michels wrote: >> Hi Ivan, >> >> Moving to the new type serializer snapshot interface is not going

Re: Error restoring Flink checkpoint

2020-06-03 Thread Maximilian Michels
> KafkaIO or Protobuf. *I meant to say "Avro or Protobuf". On 03.06.20 18:00, Maximilian Michels wrote: > Hi Ivan, > > Moving to the new type serializer snapshot interface is not going to > solve this problem because we cannot version the coder through the Beam > c

Re: Error restoring Flink checkpoint

2020-06-03 Thread Maximilian Michels
Hi Ivan, Moving to the new type serializer snapshot interface is not going to solve this problem because we cannot version the coder through the Beam coder interface. That is only possible through Flink. However, it is usually not trivial. In Beam, when you evolve your data model, the only way

Re: Beam First Steps Workshop - 9 June

2020-06-03 Thread Maximilian Michels
Awesome! On 02.06.20 22:09, Austin Bennett wrote: > Hi Beam Users, > > Wanted to share the Workshop that I'll give at Berlin Buzzword's next > week:   > https://berlinbuzzwords.de/session/first-steps-apache-beam-writing-portable-pipelines-using-java-python-go > > Do consider joining if you are

Re: Issue while submitting python beam pipeline on flink - local

2020-06-01 Thread Maximilian Michels
The logs indicate that you are not running the Docker-based execution but the `LOOPBACK` mode. In this mode the Flink cluster needs to connect to the machine that started the pipeline. That will not be possible unless you are running the Flink cluster on the same machine (we bind to `localhost`

Re: Flink Runner with HDFS

2020-05-29 Thread Maximilian Michels
/...">> with beam.Pipeline(options=options) as p: > (p | 'ReadMyFile' >> beam.io.ReadFromText(input_file_hdfs) > |"WriteMyFile" >> beam.io.WriteToText(output_file_hdfs)) -Max On 28.05.20 17:00, Ramanan, Buvana (Nokia - US/Murray Hill) wrote: > Hi Max

Re: [ANNOUNCE] Beam 2.21.0 Released

2020-05-29 Thread Maximilian Michels
Thanks Kyle! On 28.05.20 13:16, Kyle Weaver wrote: > The Apache Beam team is pleased to announce the release of version 2.21.0. > > Apache Beam is an open source unified programming model to define and > execute data processing pipelines, including ETL, batch and stream > (continuous)

Re: Issue while submitting python beam pipeline on flink - local

2020-05-28 Thread Maximilian Michels
You are using the LOOPBACK environment which requires that the Flink cluster can connect back to your local machine. Since the loopback environment by defaults binds to localhost that should not be possible. I'd suggest using the default Docker environment. On 28.05.20 14:06, Ashish Raghav

Re: Issue while submitting python beam pipeline on flink cluster - local

2020-05-28 Thread Maximilian Michels
Potentially a Windows issue. Do you have a Unix environment for testing? On 28.05.20 13:35, Ashish Raghav wrote: > Hi Guys, > >   > > I have another issue when I submit the python beam pipeline ( wordcount > example provided by apache beam team) directly on flink cluster running > local. > >  

Re: Flink Runner with HDFS

2020-05-28 Thread Maximilian Michels
The configuration looks good but the HDFS file system implementation is not intended to be used directly. Instead of: > lines = p | 'ReadMyFile' >> beam.Create(hdfs_client.open(input_file_hdfs)) Use: > lines = p | 'ReadMyFile' >> beam.io.ReadFromText(input_file_hdfs) Best, Max On 28.05.20

Re: Webinar: Unlocking the Power of Apache Beam with Apache Flink

2020-05-28 Thread Maximilian Michels
Thanks to everyone who joined and asked questions. Really enjoyed this new format! -Max On 28.05.20 08:09, Marta Paes Moreira wrote: > Thanks for sharing, Aizhamal - it was a great webinar! > > Marta > > On Wed, 27 May 2020 at 23:17, Aizhamal Nurmamat kyzy > mailto:aizha...@apache.org>> wrote:

Re: New Dates For Beam Summit Digital 2020

2020-05-21 Thread Maximilian Michels
> We thank you for your understanding! See you soon! > > -Griselda Cuevas, Brittany Hermann, Maximilian Michels, Austin Bennett, > Matthias Baetens, Alex Van Boxel >

Re: How Beam coders match with runner serialization

2020-05-19 Thread Maximilian Michels
Hi Ivan, Beam does not use Java serialization for checkpoint data. It uses Beam coders which are wrapped in Flink's TypeSerializers. That said, Beam does not support serializer migration yet. I'm curious, what do you consider a "backwards-compatible" change? If you are attempting to upgrade the

Re: Running NexMark Tests

2020-05-19 Thread Maximilian Michels
gards, > Sruthi > > On Tue, May 12, 2020 at 7:21 PM Maximilian Michels <mailto:m...@apache.org>> wrote: > > A heads-up if anybody else sees this, we have removed the flag: > https://jira.apache.org/jira/browse/BEAM-9900 > > Further contributions are very

Re: TextIO. Writing late files

2020-05-19 Thread Maximilian Michels
> Thanks > Jose > > > El lun., 18 may. 2020 a las 18:37, Reuven Lax ( <mailto:re...@google.com>>) escribió: > > This is still confusing to me - why would the messages be dropped as > late in this case? > > On Mon, May 18, 2020 at 6:14 AM Maxim

Re: TextIO. Writing late files

2020-05-18 Thread Maximilian Michels
All runners which use the Beam reference implementation drop the PaneInfo for WriteFilesResult#getPerDestinationOutputFilenames(). That's why we can observe this behavior not only in Flink but also Spark. The WriteFilesResult is returned here:

Re: Running NexMark Tests

2020-05-12 Thread Maximilian Michels
A heads-up if anybody else sees this, we have removed the flag: https://jira.apache.org/jira/browse/BEAM-9900 Further contributions are very welcome :) -Max On 11.05.20 17:05, Sruthi Sree Kumar wrote: > I have opened a PR with the documentation change. >

Re: Beam + Flink + Docker - Write to host system

2020-05-11 Thread Maximilian Michels
Hey Robbe, The issue with a higher parallelism is likely due to the single Python process which processes the data. You may want to use the `sdk_worker_parallelism` pipeline option which brings up multiple worker Python workers. Best, Max On 30.04.20 23:56, Robbe Sneyders wrote: > Yes, the

Re: GC overhead limit exceeded

2020-05-11 Thread Maximilian Michels
Generally, it is to be expected that the main input is buffered until the side input is available. We really have no other option to correctly process the data. Have you tried using RocksDB as the state backend to prevent too much GC churn? -Max On 07.05.20 06:27, Eleanore Jin wrote: > Please

Re: Set parallelism for each operator

2020-05-11 Thread Maximilian Michels
Beam and its Flink Runner do not allow setting the parallelism at the operator level. The wish to configure per-operator came up numerous times over the years. I'm not opposed to allowing for special cases, e.g. via a pipeline option. It doesn't look like it is necessary for the use case

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-05-05 Thread Maximilian Michels
 2.17.0, 2.18.0, 2.19.0, 2.20.0)? Or I have > to wait for the 2.20.1 release?  > > Thanks a lot! > Eleanore > > On Wed, Apr 22, 2020 at 2:31 AM Maximilian Michels <mailto:m...@apache.org>> wrote: > > Hi Eleanore, > > Exactly-once is not affect

Re: Apache beam job on Flink checkpoint size growing over time

2020-04-30 Thread Maximilian Michels
t. > > Many thanks, > > Steve > > > Stephen Hesketh | Client Analytics Technology > S +44 (0)7968 039848 > + stephen.hesk...@natwestmarkets.co.uk > 250 Bishopsgate | London | EC2M 4AA > The information classification of this email is Confidential unless otherwise

Re: Running Nexmark for Flink Streaming

2020-04-29 Thread Maximilian Michels
kend (filesystem) using the config >> file in the config directory specified by the env variable >> ENV_FLINK_CONF_DIR. >> >> Regards, >> Sruthi >> >> On Tue, Apr 28, 2020 at 11:35 AM Maximilian Michels wrote: >>> >>> Hi Sruthi, >&

Re: Running Nexmark for Flink Streaming

2020-04-28 Thread Maximilian Michels
Hi Sruthi, Not possible out-of-the-box at the moment. You'll have to add the RocksDB Flink dependency in flink_runner.gradle, e.g.: compile "org.apache.flink:flink-statebackend-rocksdb_2.11:$flink_version" Also in the Flink config you have to set state.backend: rocksdb Then you can run

Re: Apache beam job on Flink checkpoint size growing over time

2020-04-22 Thread Maximilian Michels
Hi Steve, The Flink Runner buffers data as part of the checkpoint. This was originally due to a limitation of Flink where we weren't able to end the bundle before we persisted the state for a checkpoint. This is due to how checkpoint barriers are emitted, I spare you the details*. Does the data

Re: Beam Digital Summit 2020 -- JUNE 2020!

2020-04-22 Thread Maximilian Michels
 Looking forward to this! Cheers, Max On 22.04.20 21:09, Austin Bennett wrote: > Hi All, > > We are excited to announce the Beam Digital Summit 2020! > > This will occur for partial days during the week of 15-19 June. > > CfP is open and found: 

Re: Running NexMark Tests

2020-04-22 Thread Maximilian Michels
The flag is needed when checkpointing is enabled because Flink is unable to create a new checkpoint when not all operators are running. By default, operators shut down when all input has been read. That will trigger sending out the maximum (final) watermark at the sources. The flag name is a bit

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-22 Thread Maximilian Michels
I assume this will impact the Exactly Once Semantics that beam provided > as in the KafkaExactlyOnceSink, the processElement method is also > annotated with @RequiresStableInput? > > Thanks a lot! > Eleanore > > On Tue, Apr 21, 2020 at 12:58 AM Maximilian Michels &

Re: Flink Runner with RequiresStableInput fails after a certain number of checkpoints

2020-04-21 Thread Maximilian Michels
Hi Stephen, Thanks for reporting the issue! David, good catch! I think we have to resort to only using a single state cell for buffering on checkpoints, instead of using a new one for every checkpoint. I was under the assumption that, if the state cell was cleared, it would not be checkpointed

Re: FlinkStateBackendFactory

2020-03-11 Thread Maximilian Michels
our use case. > > Thanks a lot! > Eleanore > > On Thu, Mar 5, 2020 at 12:46 AM Maximilian Michels <mailto:m...@apache.org>> wrote: > > Hi Eleanore, > > Good question. I think the easiest way is to configure this in the > Flink > config

Re: FlinkStateBackendFactory

2020-03-05 Thread Maximilian Michels
Hi Eleanore, Good question. I think the easiest way is to configure this in the Flink configuration file, i.e. flink-conf.yaml. Then you don't need to set anything in Beam. If you want to go with your approach, then just use getClass().getClassLoader() unless you have some custom

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Maximilian Michels
PM Maximilian Michels <mailto:m...@apache.org>> wrote: Hi Tobi, That makes sense to me. My argument was coming from having "exactly-once" semantics for a pipeline. In this regard, the stop functionality does not help. But I think having the option to grace

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-02 Thread Maximilian Michels
ght now 31), rollout a new Flink version and start them at the point where they left of with their last committed offset in Kafka. Does that make sense? Best, Tobi On Sun, Mar 1, 2020 at 5:23 PM Maximilian Michels <mailto:m...@apache.org>> wrote: In some sense, stop is differ

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-03-01 Thread Maximilian Michels
p-alive content-length: 2137 Best, Tobias On Fri, Feb 28, 2020 at 10:29 AM Maximilian Michels mailto:m...@apache.org>> wrote: The stop functionality has been removed in Beam. It was semantically identical to using cancel, so we decided to drop suppor

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-28 Thread Maximilian Michels
nd test Best, Tobi On Mon, Feb 24, 2020 at 10:13 PM Maximilian Michels mailto:m...@apache.org>> wrote: Thank you for reporting / filing /

Re: Beam 2.19.0 / Flink 1.9.1 - Session cluster error when submitting job "Multiple environments cannot be created in detached mode"

2020-02-24 Thread Maximilian Michels
Thank you for reporting / filing / collecting the issues. There is a fix pending: https://github.com/apache/beam/pull/10950 As for the upgrade issues, the 1.8 and 1.9 upgrade is trivial. I will check out the Flink 1.10 PR tomorrow. Cheers, Max On 24.02.20 09:26, Ismaël Mejía wrote: We are

Re: [BEAM] How does BEAM translate AccumulationMode to Flink Runner implementation?

2020-01-21 Thread Maximilian Michels
Hi Tison, Beam has its own set of libraries to implement windowing. Hence, the Flink Runner does not use Flink's windowing but deploys Beam's windowing logic within a Flink operator. If you want to look in the code, have a look at WindowDoFnOperator. Cheers, Max On 21.01.20 10:35, tison

Re: [Interactive Beam] Changes to local pipeline executions

2019-12-18 Thread Maximilian Michels
Thanks for the heads-up, Ning! I haven't tried out interactive Beam, but this puts it back on my radar :) Cheers, Max On 04.12.19 20:45, Ning Kang wrote: *If you are not an Interactive Beam user, you can ignore this email.* * * Hi Interactive Beam users, We've recently made some changes to

Re: List Admins: Please unsubscribe me (automatic unsubscribe fails)

2019-12-16 Thread Maximilian Michels
Hi Benjamin, There is a way to do this yourself: To unsubscribe a different e-mail - e.g. you used to be subscribed as user@oldname.example - send a message to list-unsubscribe-user=oldname.exam...@apache.org Source: https://apache.org/foundation/mailinglists.html So send an email to:

Re: No filesystem found for scheme s3 using FileIO

2019-12-11 Thread Maximilian Michels
uot;Maximilian Michels" mailto:m...@apache.org>> wrote:     Hi Preston,     Sorry about the name mixup, of course I meant to write Preston not     Magnus :) See my reply below.     cheers,     Max     On 25.09.19 08:31, Maximilian Michels wr

Re: How side input will work for streaming application in apache beam.

2019-10-22 Thread Maximilian Michels
Hi Jitendra, Side inputs are materialized based on their windowing. If you assign a 10 second window to the side inputs, they can be renewed every 10 seconds. Whenever you access the side input, the newest instance of the side input will be retrieved. Cheers, Max On 22.10.19 10:42,

Re: Python Portable Runner Issues

2019-10-01 Thread Maximilian Michels
Probably the most stable is running on Dataflow still. But I’m excited to see the progress towards a Spark runner, can’t wait to try TFT on it :) That is debatable. It is also hard to compare because Dataflow is a managed service, whereas you'll have to spin up your own cluster for other

Re: No filesystem found for scheme s3 using FileIO

2019-09-25 Thread Maximilian Michels
Hi Preston, Sorry about the name mixup, of course I meant to write Preston not Magnus :) See my reply below. cheers, Max On 25.09.19 08:31, Maximilian Michels wrote: Hi Magnus, Your observation seems to be correct. There is an issue with the file system registration. The two types

Re: No filesystem found for scheme s3 using FileIO

2019-09-25 Thread Maximilian Michels
familiar about Flink sure why S3 is not properly being registered when running the Flink job. Ccing some folks who are more familiar about Flink. +Ankur Goenka <mailto:goe...@google.com> +Maximilian Michels <mailto:m...@apache.org> Thanks, Cham On Sat, Sep 21, 2019 at 9:18 AM Kopri

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Maximilian Michels
That's even better. On 19.09.19 16:35, Robert Bradshaw wrote: On Thu, Sep 19, 2019 at 4:33 PM Maximilian Michels wrote: This is obviously less than ideal for the user... Should we "fix" the Java SDK? Of is the long-terms solution here to have runners do this rewrite? I think i

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-19 Thread Maximilian Michels
code paths for Reads. On 19.09.19 11:46, Robert Bradshaw wrote: On Thu, Sep 19, 2019 at 11:22 AM Maximilian Michels wrote: The flag is insofar relevant to the PortableRunner because it affects the translation of the pipeline. Without the flag we will generate primitive Reads which are unsupported in p

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-18 Thread Maximilian Michels
| Software Engineer | github.com/ibzib <http://github.com/ibzib> | kcwea...@google.com <mailto:kcwea...@google.com> On Tue, Sep 17, 2019 at 3:38 PM Ahmet Altay <mailto:al...@google.com>> wrote: On Tue, Sep 17, 2019 at 2:26 PM Maximilian Michels mailto:m.

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Maximilian Michels
here are for the Create and Read transforms.) > On Tue, Sep 17, 2019 at 11:33 AM Maximilian Michels mailto:m...@apache.org>> wrote: >> >> +dev >> >>

Re: Flink Runner logging FAILED_TO_UNCOMPRESS

2019-09-17 Thread Maximilian Michels
+dev The beam_fn_api flag and the way it is automatically set is error-prone. Is there anything that prevents us from removing it? I understand that some Runners, e.g. Dataflow Runner have two modes of executing Python pipelines (legacy and portable), but at this point it seems clear that

Re: How do I run Beam Python pipelines using Flink deployed on Kubernetes?

2019-09-11 Thread Maximilian Michels
Hi Andrea, You could use the new worker_pool_main.py which was developed for the Kubernetes use case. It works together with the external environment factory. Cheers, Max On 11.09.19 18:51, Lukasz Cwik wrote: Yes. On Wed, Sep 11, 2019 at 3:12 AM Andrea Medeghini

Re: Hackathon @BeamSummit @ApacheCon

2019-08-29 Thread Maximilian Michels
Hey, I'm in as well! Austin and I recently talked about how we could organize the hackathon. Likely it will be an hour per day for exchanging ideas and learning about Beam. For example, there has been interest from the Apache Streams project to discuss points for collaboration. We will soon

Re: [External] Re: --state_backend PipelineOption not supported in python when running on Flink

2019-08-28 Thread Maximilian Michels
Maximilian Michels <mailto:m...@apache.org>> wrote: Hi Catlyn, This option has never worked outside the Java SDK where it originates from. For the upcoming Beam 2.16.0 release, we have replaced this option with a factory class: https://github.com/apache/

Re: --state_backend PipelineOption not supported in python when running on Flink

2019-08-27 Thread Maximilian Michels
Hi Catlyn, This option has never worked outside the Java SDK where it originates from. For the upcoming Beam 2.16.0 release, we have replaced this option with a factory class:

Re: Using Beam 2.14.0 + Flink 1.8 with RocksDB state backend - perhaps missing dependency?

2019-08-14 Thread Maximilian Michels
> mailto:tobias.kay...@ricardo.ch>> > wrote: > > * each time :) > > On Mon, Aug 12, 2019 at 9:48 PM Kaymak, Tobias > <mailto:tobias.kay...@ricardo.ch>> wrote: > >

Dropping support for Flink 1.5/1.6

2019-08-13 Thread Maximilian Michels
Dear users of the Flink Runner, Just a heads-up that we may remove Flink 1.5/1.6 support for future version of Beam (>=2.16.0): https://jira.apache.org/jira/browse/BEAM-7962 Why? We support 1.5 - 1.8 and have a 1.9 version in the making. The increased number of supported versions manifests in a

Re: Using Beam 2.14.0 + Flink 1.8 with RocksDB state backend - perhaps missing dependency?

2019-08-12 Thread Maximilian Michels
Hi Tobias! I've checked if there were any relevant changes to the RocksDB state backend in 1.8.1, but I couldn't spot anything. Could it be that an old version of RocksDB is still in the Flink cluster path? Cheers, Max On 06.08.19 16:43, Kaymak, Tobias wrote: > And of course the moment I

Re: [python] ReadFromPubSub broken in Flink

2019-07-13 Thread Maximilian Michels
Hi Chad, This stub will only be replaced by the Dataflow service. It's an artifact of the pre-portability era. That said, we now have the option to replace ReadFromPubSub with an external transform which would utilize Java's PubSubIO via the new cross-language feature. Thanks, Max On

Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Maximilian Michels
r such crucial behavior i would expect the pipeline >     to fail with a clear message stating the reason, like in the same >     way when you implement a new Codec and forget to override the >     verifyDeterministic method (don't recall the right name of it). > >     Just my 2

Re: Why is my RabbitMq message never acknowledged ?

2019-06-14 Thread Maximilian Michels
This has come up before: https://issues.apache.org/jira/browse/BEAM-4520 The issue is that checkpoints won't be acknowledged if checkpointing is disabled in Flink. We throw a WARN when unbounded sources are used without checkpointing. Not all unbounded sources actually need to finalize

Re: Parallel computation of windows in Flink

2019-06-12 Thread Maximilian Michels
0% sure, but it seems like the translation from Beam to Flink > could potentially include the windows in the key of `.keyBy` to allow > parallelizing window computations. Does this sound like a reasonable > feature request? > > Mike > > Ladder <http://bit.ly/1VRtWfS>.

Re: Parallel computation of windows in Flink

2019-06-11 Thread Maximilian Michels
c of the data. > >     If possible, can you share your pipeline code at a high level. > >     On Mon, Jun 10, 2019 at 12:58 PM Mike Kaplinskiy >     mailto:m...@ladderlife.com>> wrote: > > >     Ladder <http://bit.ly/1VRtWfS>. The smart, modern way to insure >     y

Re: Python sdk performance

2019-06-10 Thread Maximilian Michels
Hi Mingliang, You can increase the parallelism of the Python SDK Harness via the pipeline option   --experimental worker_threads= Note that the workers are Python threads which suffer from the Global Interpreter Lock. We currently do not use real processes, e.g. via multiprocessing. There

Re: Parallel computation of windows in Flink

2019-06-10 Thread Maximilian Michels
Hi Mike, If you set the number of shards to 1, you should get one shard per window; unless you have "ignore windows" set to true. > (The way I'm checking this is via the Flink UI) I'm curious, how do you check this via the Flink UI? Cheers, Max On 09.06.19 22:33, Mike Kaplinskiy wrote: > Hi

Re: [ANNOUNCE] Apache Beam 2.13.0 released!

2019-06-10 Thread Maximilian Michels
Thanks for managing the release, Ankur! @Chad Thanks for the feedback. I agree that we can improve our release notes. The particular issue you were looking for was part of the detailed list [1] linked in the blog post: https://jira.apache.org/jira/browse/BEAM-7029 Cheers, Max [1]

Re: Question about --environment_type argument

2019-05-29 Thread Maximilian Michels
java#L63 >> >> On Tue, May 28, 2019 at 11:39 AM 青雉(祁明良) > <mailto:m...@xiaohongshu.com>> wrote: >>> >>> Yes, I did (2). Since the job server successfully >>> created the artifact directory, I think I did it correctly. And >>>

Re: Question about --environment_type argument

2019-05-28 Thread Maximilian Michels
the artifact directory is successfully created at HDFS by job server, but fails at task manager when reading. Best, Mingliang 获取 Outlook for iOS <https://aka.ms/o0ukef> On Tue, May 28, 2019 at 11:47 PM +0800, "Maximilian Michels" mailto:m...@apache.org>> wrote: Recen

Re: Question about --environment_type argument

2019-05-28 Thread Maximilian Michels
Recent versions of Flink do not bundle Hadoop anymore, but they are still "Hadoop compatible". You just need to include the Hadoop jars in the classpath. Beams's Hadoop does not bundle Hadoop either, it just provides Beam file system abstractions which are similar to Flink "Hadoop

Re: Problem running a pipeline on a remote Flink cluster

2019-05-14 Thread Maximilian Michels
, and when I tested it for the WordCount example it ran without problems. Also, it runs successfully if I run on [local], if I am using incompatible versions this should fail right? Regards, Jorik -Original Message- From: Maximilian Michels Sent: Monday, May 13, 2019 12:37 PM To: user

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-14 Thread Maximilian Michels
Just saw that the malformed master URL was due to HTML formatting. It looks ok. Please check your Flink JobManager logs. The JobManager might not reachable and the submission is just blocked on it becoming available. Thanks, Max On 13.05.19 20:05, Maximilian Michels wrote: Hi Averell

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-13 Thread Maximilian Michels
till stuck with it. Again, thanks a lot for your help. Regards, Averell On Sat, 11 May 2019, 12:35 am Maximilian Michels, mailto:m...@apache.org>> wrote: Hi Averell, What you want to do is possible today but at this point is an early experimental feature

Re: Problem running a pipeline on a remote Flink cluster

2019-05-13 Thread Maximilian Michels
Hi Jorik, It looks like the version of the Flink Runner and the Flink cluster version do not match. For example, if you use Flink 1.7, make sure to use the beam-runners-flink-1.7 artifact. For more information: https://beam.apache.org/documentation/runners/flink/ Thanks, Max On 12.05.19

Re: Wordcount using Python with Flink runner and Kafka source

2019-05-10 Thread Maximilian Michels
Hi Averell, What you want to do is possible today but at this point is an early experimental feature. The reason for that is that Kafka is a cross-language Java transform in a Python pipeline. We just recently enabled cross-language pipelines. 1. First of all, until 2.13.0 is released you

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-03 Thread Maximilian Michels
5.19 22:44, Jan Lukavský wrote: Hi Max, comments inline. On 5/2/19 3:29 PM, Maximilian Michels wrote: Couple of comments: * Flink transforms It wouldn't be hard to add a way to run arbitrary Flink operators through the Beam API. Like you said, once you go down that road, you loose the abil

Re: Will Beam add any overhead or lack certain API/functions available in Spark/Flink?

2019-05-02 Thread Maximilian Michels
is this for a typical user? Do people stop using a particular tool because it is in an X language? I personally would put features first over language portability and it's completely fine that may not be in line with Beam's priorities. All said I can agree Beam focus on language portability is gre

Re: Scope of windows?

2019-04-30 Thread Maximilian Michels
eam.apache.org On Sun, Apr 28, 2019 at 7:08 PM Reza Rokni wrote: +1 I recall a fun afternoon a few years ago figuring this out ... On Mon, 11 Mar 2019 at 18:36, Maximilian Michels wrote: Hi, I have seen several users including myself get confused by the "default" triggering

Re: Beam Summit at ApacheCon

2019-04-23 Thread Maximilian Michels
Hi Austin, Thanks for the heads-up! I just want to highlight that this is a great chance for Beam. There will be a _dedicated_ Beam track which means that there is potential for lots of new people to learn about Beam. Of course, there will also be many people already involved in Beam. -Max

Re: Is AvroCoder the right coder for me?

2019-04-02 Thread Maximilian Michels
ugusto> > > On 2019/03/26 12:31:54, Maximilian Michels <http://apache.org>> wrote: > > > Hi Augusto,> > > > > > > Generally speaking Avro should provide very good performance. The calls > > > > you are seeing should not be significa

Re: Flink 1.7.2 + Beam 2.11 error: The transform beam:transform:create_view:v1 is currently not supported.

2019-03-29 Thread Maximilian Michels
fixes it. On Fri, Mar 29, 2019 at 11:53 AM Maximilian Michels <mailto:m...@apache.org>> wrote: Hi Tobias, Thank for reporting. Can confirm, this is a regression with the detection of the execution mode. Everything should work fine if you set the "streaming

Re: Imposing parallelism

2019-03-29 Thread Maximilian Michels
Hi Augusto, In Beam there is no way to specify how parallel a specific transform should be. There is only a general indicator for how parallel a pipeline should be, i.e. Dataflow has "numWorkers", Spark/Flink have "parallelism". You should see 16 parallel operations for your Read if you

Re: Flink 1.7.2 + Beam 2.11 error: The transform beam:transform:create_view:v1 is currently not supported.

2019-03-29 Thread Maximilian Michels
Hi Tobias, Thank for reporting. Can confirm, this is a regression with the detection of the execution mode. Everything should work fine if you set the "streaming" flag to true. Will be fixed for the 2.12.0 release. Thanks, Max On 28.03.19 17:28, Lukasz Cwik wrote: +dev

  1   2   >