Re: Output timestamp for Python event timers

2020-08-11 Thread Boyuan Zhang
Hi Maximilian, It makes sense to set hold_timestamp as fire_timestamp when the fire_timestamp is in the event time domain. Otherwise, the system may advance the watermark incorrectly. I think we can do something similar to Java FnApiRunner[1]: - Expose set_output_timestamp API to python

Re: Stateful Pardo Question

2020-08-11 Thread jmac...@godaddy.com
Ahhh I see. Thank you very much for this additional info. Really helpful! I think after considering further, its probably more appropriate and less risky in my current scenario to try to use the Session combiner. I did really like the Stateful ParDo way of doing things tho, if it were simpler

Re: Stateful Pardo Question

2020-08-11 Thread jmac...@godaddy.com
+1 From: Reza Ardeshir Rokni Reply-To: "dev@beam.apache.org" Date: Sunday, August 9, 2020 at 5:05 PM To: dev Subject: Re: Stateful Pardo Question Notice: This email is from an external sender. +1 on having the behavior clearly documented, would also be great to try and add more stat and

Re: [BEAM-10292] change proposal to DefaultFilenamePolicy.ParamsCoder

2020-08-11 Thread Luke Cwik
The filesystem "fixes" all surmount to removing the "isDirectory" boolean bit and encoding whether something is a directory in the string part of the resource specification which also turns out to be backwards incompatible (just in a different way). Removing the "directory" bit would be great and

Re: Output timestamp for Python event timers

2020-08-11 Thread Yichi Zhang
+1 to expose set_output_timestamp and enrich python set timer api. On Tue, Aug 11, 2020 at 12:01 PM Boyuan Zhang wrote: > Hi Maximilian, > > It makes sense to set hold_timestamp as fire_timestamp when the > fire_timestamp is in the event time domain. Otherwise, the system may > advance the

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-08-11 Thread Luke Cwik
There shouldn't be any changes required since the wrapper will smoothly transition the execution to be run as an SDF. New IOs should strongly prefer to use SDF since it should be simpler to write and will be more flexible but they can use the "*Source"-based APIs. Eventually we'll deprecate the

Study On Rejected Refactorings

2020-08-11 Thread Jevgenija Pantiuchina
Dear contributors, As part of a research team from Università della Svizzera italiana (Switzerland) and University of Sannio (Italy), we have analyzed refactoring pull requests in apache/beam repository and are looking for developers for a short 5-10 min survey

Re: Output timestamp for Python event timers

2020-08-11 Thread Luke Cwik
+1 on what Boyuan said. It is important that the defaults for processing time domain differ from the defaults for the event time domain. On Tue, Aug 11, 2020 at 12:36 PM Yichi Zhang wrote: > +1 to expose set_output_timestamp and enrich python set timer api. > > On Tue, Aug 11, 2020 at 12:01 PM

Re: [PROPOSAL] Preparing for Beam 2.24.0 release

2020-08-11 Thread Daniel Oliveira
I'd like to send out a last minute reminder to fill out CHANGES.md with any major changes that are going to be in 2.24.0. If you need a quick review for that, just add me as a reviewer to your PR (GitHub username is "youngoli"). I'll keep an

Re: Memory Issue When Running Beam On Flink

2020-08-11 Thread Jan Lukavský
Hi David, what's the state backend you use for Flink? The default probably would be FsStateBackend, which stores whole state in memory of TaskManager. That could explain the behavior you are seeing, as the deduplication has to store all seen keys in memory. I'm afraid that although the key is

Re: Status of dynamic worker scaling with Kafka consumers

2020-08-11 Thread Alexey Romanenko
Hi Adam, 1) Correct. Current KafkaIO.Read implementation is based on Beam “UnboundedSource” which requires to have fixed number of splits at DAG construction time. 2) Correct. Dynamic topics and partitions discovering is a long story in Beam. Since you are interested in this, it would be

Re: Memory Issue When Running Beam On Flink

2020-08-11 Thread Maximilian Michels
Hi! Looks like a potential leak, caused by your code or by Beam itself. Would you be able to supply a heap dump from one of the task managers? That would greatly help debugging this issue. -Max On 07.08.20 00:19, David Gogokhiya wrote: Hi, We recently started using Apache Beam version

Re: Status of dynamic worker scaling with Kafka consumers

2020-08-11 Thread Adam Bellemare
Thank you Alexey, I appreciate your responses. On Tue, Aug 11, 2020 at 10:57 AM Alexey Romanenko wrote: > Hi Adam, > > 1) Correct. Current KafkaIO.Read implementation is based on Beam > “UnboundedSource” which requires to have fixed number of splits at DAG > construction time. > 2) Correct. > >

Re: [DISCUSS][BEAM-10670] Migrating BoundedSource/UnboundedSource to execute as a Splittable DoFn for non-portable Java runners

2020-08-11 Thread Alexey Romanenko
Hi Luke, Great to hear about such progress on this! Talking about opt-out for all runners in the future, will it require any code changes for current “*Source”-based IOs or the wrappers should completely smooth this transition? Do we need to require to create new IOs only based on SDF or

Re: Status of dynamic worker scaling with Kafka consumers

2020-08-11 Thread Adam Bellemare
Hello Alexey Thank you for replying to my questions. A number of my colleagues have been musing about the idea of dynamically changing the partition count of Apache Kafka's input topics for Beam jobs during runtime (We intend to use the Google Dataflow runner for our jobs). I have been hesitant

Output timestamp for Python event timers

2020-08-11 Thread Maximilian Michels
We ran into problems setting event time timers per-element in the Python SDK. Pipeline progress would stall. Turns out, although the Python SDK does not expose the timer output timestamp feature to the user, it sets the timer output timestamp to the current input timestamp of an element.