Re: [VOTE] Release 2.56.0, release candidate #2

2024-04-29 Thread Jan Lukavský
+1 (binding). Tested Java SDK with Flink runner.  Jan On 4/28/24 15:32, XQ Hu via dev wrote: +1 (non-binding). Tested it using the dataflow ML pipeline: https://github.com/google/dataflow-ml-starter/actions/runs/8862170843/job/24334816481 On Sat, Apr 27, 2024 at 7:42 AM Danny McCormick via

Re: [VOTE] Release 2.56.0, release candidate #1

2024-04-26 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner.  Jan On 4/25/24 05:17, XQ Hu via dev wrote: +1 (non binding). Tested the simple Dataflow ML job: https://github.com/google/dataflow-ml-starter/actions/runs/8824985423/job/24228468173 On Wed, Apr 24, 2024 at 2:01 PM Danny McCormick via dev

Re: PCollection#applyWindowingStrategyInternal

2024-04-25 Thread Jan Lukavský
;retraction combine"). On 4/23/24 18:08, Reuven Lax via dev wrote: On Tue, Apr 23, 2024 at 7:52 AM Jan Lukavský wrote: On 4/22/24 20:40, Kenneth Knowles wrote: I'll go ahead and advertise https://s.apache.org/beam-sink-triggers again for this thread. +1 There a

Re: PCollection#applyWindowingStrategyInternal

2024-04-23 Thread Jan Lukavský
with timers. One could however still propagate the trigger upstream of the stateful ParDo, though I'm not sure if that's the best approach. On Mon, Apr 15, 2024 at 11:31 PM Jan Lukavský wrote: On 4/11/24 18:20, Reuven Lax via dev wrote: I'm not sure it would require all th

Re: PCollection#applyWindowingStrategyInternal

2024-04-16 Thread Jan Lukavský
ream ParDo emits any data. Yes, one can argue that stateful ParDo is supposed to emit data at fast as possible, then this seems to work. On Thu, Apr 11, 2024 at 5:10 AM Jan Lukavský wrote: I've probably heard about it, but I never read the proposal. Sounds great, but that would require

Re: PCollection#applyWindowingStrategyInternal

2024-04-11 Thread Jan Lukavský
to today where we attach triggering to the windowing information. This was a proposal some years back and there was some effort made to implement it, but the implementation never really got off the ground. On Wed, Apr 10, 2024 at 12:43 AM Jan Lukavský wrote: On 4/9/24 18:33, Kenneth Kno

Re: PCollection#applyWindowingStrategyInternal

2024-04-10 Thread Jan Lukavský
iggering semantics is 100% correct"? Probably the - wrong - expectations at the beginning of this thread were due to conflict in my mental model of how things 'could' work as opposed to how they actually work. :)  Jan Kenn On Tue, Apr 9, 2024 at 9:19 AM Jan Lukavský wrote: On 4/6/24 2

Re: PCollection#applyWindowingStrategyInternal

2024-04-09 Thread Jan Lukavský
ly documented, so I doubt many users know about this! Reuven On Sat, Apr 6, 2024 at 7:09 AM Jan Lukavský wrote: Immediate self-correction, although setting the strategy directly via setWindowingStrategyInternal() *seemed* to be working during Pipeline construction time, during runti

Re: PCollection#applyWindowingStrategyInternal

2024-04-06 Thread Jan Lukavský
, but there remains the other question if we can make flattening PCollections with incompatible windowFns more user-friendly. The current approach where we require the same windowFn for all input PCollections creates some unnecessary boilerplate code needed on user side.  Jan On 4/6/24 15:45, Jan

PCollection#applyWindowingStrategyInternal

2024-04-06 Thread Jan Lukavský
Hi, I came across a case where using PCollection#applyWindowingStrategyInternal seems legit in user core. The case is roughly as follows:  a) compute some streaming statistics  b) apply the same transform (say ComputeWindowedAggregation) with different parameters on these statistics

Re: Patch release proposal

2024-03-28 Thread Jan Lukavský
+1 to either doing full release or deferring to 2.56.0.  Jan On 3/28/24 16:52, Yi Hu via dev wrote: > Just releasing Python can break multi-lang by default (unless expansion service is overridden manually) since we match versions across languages when picking the default expansion service.

Re: [VOTE] Release 2.55.0, release candidate #3

2024-03-21 Thread Jan Lukavský
+1 (binding) Tested Java SDK with FlinkRunner.  Jan On 3/20/24 22:40, Chamikara Jayalath via dev wrote: +1 (binding) Tested multi-lang Java/Python pipelines and upgrading BQ/Kafka transforms from 2.53.0 to 2.55.0 using the Transform Service. Thanks, Cham On Tue, Mar 19, 2024 at 2:10 PM

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-29 Thread Jan Lukavský
not create the problem of "vector watermarks". The Throttle transform would then use the backling for feedback loop to slowdown the request rate. On 2/29/24 14:57, Jan Lukavský wrote: From my understanding Flink rate limits based on local information only. On the other hand - in cas

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-29 Thread Jan Lukavský
(the processor/server + DoFn client) could be built, but not purely be limited to Rate limiting/Throttling. Possibly mumble mumble StatePipe? But that feels like a harder problem for the time being. Robert Burke On 2024/02/28 08:25:35 Jan Lukavský wrote: On 2/27/24 19:49, Robert Bradshaw via dev wrote

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-28 Thread Jan Lukavský
On 2/27/24 19:49, Robert Bradshaw via dev wrote: On Tue, Feb 27, 2024 at 10:39 AM Jan Lukavský wrote: On 2/27/24 19:22, Robert Bradshaw via dev wrote: On Mon, Feb 26, 2024 at 11:45 AM Kenneth Knowles wrote: Pulling out focus points: On Fri, Feb 23, 2024 at 7:21 PM Robert Bradshaw via dev

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
at are too ahead of time do not blow up downstream state. These might be related concepts. We'd need a discussion of what an SDK must do if the runner doesn't support the central clock for completeness, and consistency. On Tue, Feb 27, 2024, 6:58 AM Jan Lukavský wrote: On 2/27/24 14:51, Kenneth

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
I'm tailing logs and was eagerly started before they were fully written, or waiting for some kind of (non-data-dependent) quiessence or other operation to finish). On Fri, Feb 23, 2024 at 12:36 AM Jan Lukavský wrote: For me it always helps to seek analogy in our physical reality. Stream

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
he central clock for completeness, and consistency. On Tue, Feb 27, 2024, 6:58 AM Jan Lukavský wrote: On 2/27/24 14:51, Kenneth Knowles wrote: I very much like the idea of processing time clock as a parameter to @ProcessElement. That will be obviously useful and remove a

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-27 Thread Jan Lukavský
yes, I was just trying to restate the streaming processing time semantics in the limited batch case. Kenn On Tue, Feb 27, 2024 at 2:40 AM Jan Lukavský wrote: I think that before we introduce a possibly somewhat duplicate new feature we should be certain that it is really semantically

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-26 Thread Jan Lukavský
; > can't think of a batch or streaming scenario where it would be correct > > to not wait at least that long (even in batch inputs, e.g. suppose I'm > > tailing logs and was eagerly started before they were fully wri

Re: [DISCUSS] Processing time timers in "batch" (faster-than-wall-time [re]processing)

2024-02-23 Thread Jan Lukavský
For me it always helps to seek analogy in our physical reality. Stream processing actually has quite a good analogy for both event-time and processing-time - the simplest model for this being relativity theory. Event-time is the time at which events occur _at distant locations_. Due to finite

Re: Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-21 Thread Jan Lukavský
be left in their original packages for backwards compatibility reasons? On Wed, Feb 21, 2024 at 7:32 AM Jan Lukavský wrote: > > Hi, > > while implementing FlinkRunner for Flink 1.17 I tried to verify that a

Re: Throttle PTransform

2024-02-21 Thread Jan Lukavský
, but severely limiting the state size. However I wouldn't start here - we would want to build the simpler implementation first and see how it performs. +1 On Wed, Feb 21, 2024 at 8:53 AM Robert Bradshaw via dev wrote: On Wed, Feb 21, 2024 at 12:48 AM Jan Lukavský wrote: > &

Re: Throttle PTransform

2024-02-21 Thread Jan Lukavský
On 2/21/24 17:52, Robert Bradshaw via dev wrote: On Wed, Feb 21, 2024 at 12:48 AM Jan Lukavský wrote: Hi, I have left a note regarding the proposed splitting of batch and streaming expansion of this transform. In general, a need for such split triggers doubts in me. This signals that either

Pipeline upgrade to 2.55.0-SNAPSHOT broken for FlinkRunner

2024-02-21 Thread Jan Lukavský
Hi, while implementing FlinkRunner for Flink 1.17 I tried to verify that a running Pipeline is able to successfully upgrade from Flink 1.16 to Flink 1.17. There is some change regarding serialization needed for Flink 1.17, so this was a concern. Unfortunately recently we merged

Re: Throttle PTransform

2024-02-21 Thread Jan Lukavský
Hi, I have left a note regarding the proposed splitting of batch and streaming expansion of this transform. In general, a need for such split triggers doubts in me. This signals that either  a) the transform does something is should not, or  b) Beam model is not complete in terms of being

Re: [ANNOUNCE] New Committer: Svetak Sundhar

2024-02-15 Thread Jan Lukavský
Congrats Svetak! On 2/14/24 16:11, Yi Hu via dev wrote: Congrats, Svetak! On Wed, Feb 14, 2024 at 9:50 AM John Casey via dev wrote: Congrats Svetak! On Wed, Feb 14, 2024 at 9:00 AM Ahmed Abualsaud wrote: Congrats Svetak! On 2024/02/14 02:05:02 Priyans Desai

Re: [VOTE] Release 2.54.0, release candidate #2

2024-02-07 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner.  Jan On 2/7/24 06:23, Robert Burke via dev wrote: Hi everyone, Please review and vote on the release candidate #2 for the version 2.54.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

2024-01-31 Thread Jan Lukavský
Hi, if I understand this proposal correctly, the motivation is actually reducing latency by bypassing bundle atomic guarantees, bundles after "at least once" Reshuffle would be reconstructed independently of the pre-shuffle bundling. Provided this is correct, it seems that the behavior is

Re: @RequiresTimeSortedInput adoption by runners

2024-01-20 Thread Jan Lukavský
the Java Validates Runner suite. Robert Burke Beam Go Busybody On Fri, Jan 19, 2024, 6:41 AM Jan Lukavský wrote: I was primarily focused on Java SDK (and core-contruction-java), but generally speaking, any SDK can provide default expansion that runners can use so that it is not (shou

Re: @RequiresTimeSortedInput adoption by runners

2024-01-19 Thread Jan Lukavský
eam_runner_api.proto#L1132 [4]https://github.com/apache/beam/issues?q=is%3Aissue+is%3Aopen+OrderedListState [5]https://github.com/search?q=repo%3Aapache%2Fbeam+RequiresTimeSortedInput=code=2 [6]https://github.com/apache/beam/blob/b4c23b32f2b80ce052c8a235e5064c69f37df992/website/www/site/content/e

@RequiresTimeSortedInput adoption by runners

2024-01-18 Thread Jan Lukavský
Hi, recently I came across the fact that most runners do not support @RequiresTimeSortedInput annotation for sorting per-key data by event timestamp [1]. Actually, runners supporting it seem to be Direct java, Flink and Dataflow batch (as it is a noop there). The annotation has use-cases in

Re: [VOTE] Release 2.53.0, release candidate #2

2023-12-28 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner.  Jan On 12/27/23 14:13, Danny McCormick via dev wrote: +1 (non-binding) Tested with some example ML notebooks. Thanks, Danny On Tue, Dec 26, 2023 at 6:41 PM XQ Hu via dev wrote: +1 (non-binding) Tested with the simple RunInference

Re: [VOTE] Release 2.52.0, release candidate #5

2023-11-15 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/15/23 11:35, Jean-Baptiste Onofré wrote: +1 (binding) Quickly tested Java SDK and checked the legal part (hash, signatures, headers). Regards JB On Tue, Nov 14, 2023 at 12:06 AM Danny McCormick via dev wrote:

Re: [VOTE] Release 2.52.0, release candidate #4

2023-11-13 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/12/23 00:44, Danny McCormick via dev wrote: Hi everyone, Please review and vote on the release candidate #3 for the version 2.52.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release

Re: [VOTE] Release 2.52.0, release candidate #3

2023-11-09 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/9/23 03:31, Danny McCormick via dev wrote: Hi everyone, Please review and vote on the release candidate #3 for the version 2.52.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release

Re: [VOTE] Release 2.52.0, release candidate #2

2023-11-08 Thread Jan Lukavský
+1 (binding) Validated Java SDK with Flink runner on own use cases.  Jan On 11/8/23 00:24, Danny McCormick via dev wrote: Hi everyone, Please review and vote on the release candidate #2 for the version 2.52.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release

[LAZY CONSENSUS] Deprecate Euphoria extension

2023-11-02 Thread Jan Lukavský
Hi, according to discussion [1], because no objections were raised and the overall usage (artifact download stats) is negligible compared to other Beam artifacts, I'll proceed with deprecating the Euphoria extension, unless there are any objections within 72 hours (excluding weekend). Best,

Re: Processing time watermarks in KinesisIO

2023-11-01 Thread Jan Lukavský
but let users specify it manually, provided they know the consequences. Jan [1] https://issues.apache.org/jira/browse/BEAM-591 On 10/31/23 21:36, Robert Bradshaw via dev wrote: On Tue, Oct 31, 2023 at 10:28 AM Jan Lukavský wrote: On 10/31/23 17:44, Robert Bradshaw via dev wrote: There are rea

Re: Processing time watermarks in KinesisIO

2023-10-31 Thread Jan Lukavský
a bit broken if I understand correctly. +1 On Tue, Oct 31, 2023 at 1:16 AM Jan Lukavský wrote: I think that instead of deprecating and creating new version, we could leverage the proposed update compatibility flag for this [1]. I still have some doubts if the processing-time watermarking

Re: Processing time watermarks in KinesisIO

2023-10-31 Thread Jan Lukavský
deprecated /“org.apache.beam.sdk.io.kinesis.KinesisIO”/ one. — Alexey On 27 Oct 2023, at 17:42, Jan Lukavský wrote: No, I'm referring to this [1] policy which has unexpected (and hardly avoidable on the user-code side) data loss issues. The problem is that assigning timestamps t

Re: Processing time watermarks in KinesisIO

2023-10-27 Thread Jan Lukavský
/releases/javadoc/current/org/apache/beam/sdk/io/kinesis/KinesisIO.Read.html#withProcessingTimeWatermarkPolicy-- On 10/27/23 16:51, Alexey Romanenko wrote: Why not just to create a custom watermark policy for that? Or you mean to make it as a default policy? — Alexey On 27 Oct 2023, at 10:25, Jan

Processing time watermarks in KinesisIO

2023-10-27 Thread Jan Lukavský
Hi, when discussing about [1] we found out, that the issue is actually caused by processing time watermarks in KinesisIO. Enabling this watermark outputs watermarks based on current processing time, _but event timestamps are derived from ingestion timestamp_. This can cause unbounded

Re: Reshuffle PTransform Design Doc

2023-10-20 Thread Jan Lukavský
ing to have one Beam primitive for each thing that is probably a runner primitive. On Thu, Oct 19, 2023 at 2:25 PM Kenneth Knowles wrote: On Fri, Oct 13, 2023 at 12:51 PM Jan Lukavský wrote: Hi, I think there's been already said nearly everything in this thread, but ... it is time for Friday disc

Re: [YAML] Aggregations

2023-10-19 Thread Jan Lukavský
On 10/19/23 19:41, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský wrote: On 10/19/23 18:28, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: Rill is definitely SQL-oriented but I think that's going to be the most common

Re: [YAML] Aggregations

2023-10-19 Thread Jan Lukavský
On 10/19/23 18:28, Robert Bradshaw via dev wrote: On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis wrote: Rill is definitely SQL-oriented but I think that's going to be the most common. Dataframes are explicitly modeled on the relational approach so that's going to look a lot like SQL, I think

Re: KafkaIO does not make use of Kafka Consumer Groups [kafka] [java] [io]

2023-10-18 Thread Jan Lukavský
Hi, my two cents on this. While it would perfectly possible to use consumer group in KafkaIO, it has its own issues. The most visible would be, that using subscriptions might introduce unnecessary duplicates in downstream processing. The reason for this is that consumer in a consumer group

Re: [ANNOUNCE] New Committer: Sam Whittle

2023-10-17 Thread Jan Lukavský
Congrats Sam! On 10/16/23 22:34, Austin Bennett wrote: Thanks, Sam! On Mon, Oct 16, 2023 at 12:39 PM XQ Hu via dev wrote: Congratulations! On Mon, Oct 16, 2023 at 1:58 PM Ahmet Altay via dev wrote: Congratulations Sam! On Mon, Oct 16, 2023 at 10:42 AM Byron

Re: [ANNOUNCE] New Committer: Byron Ellis

2023-10-17 Thread Jan Lukavský
Congrats Byron! On 10/16/23 22:33, Austin Bennett wrote: thanks, Byron! On Mon, Oct 16, 2023 at 12:38 PM XQ Hu via dev wrote: Congratulations! On Mon, Oct 16, 2023 at 1:58 PM Ahmet Altay via dev wrote: Congratulations Byron! On Mon, Oct 16, 2023 at 10:35 AM

Re: [DISCUSS] Drop Euphoria extension

2023-10-16 Thread Jan Lukavský
/16/23 15:10, Alexey Romanenko wrote: Can we just deprecate it for a while and then remove completely? — Alexey On 13 Oct 2023, at 18:59, Jan Lukavský wrote: Hi, it has been some time since Euphoria extension [1] has been adopted by Beam as a possible "Java 8 API". Beam has evo

[DISCUSS] Drop Euphoria extension

2023-10-13 Thread Jan Lukavský
Hi, it has been some time since Euphoria extension [1] has been adopted by Beam as a possible "Java 8 API". Beam has evolved from that time a lot, the current API seems actually more elegant than the original Euphoria's and last but not least, it has no maintainers and no known users. If

Re: Reshuffle PTransform Design Doc

2023-10-13 Thread Jan Lukavský
ri, Oct 6, 2023 at 3:07 PM Jan Lukavský wrote: On 10/6/23 15:11, Kenneth Knowles wrote: On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský wrote: Hi, there is also one other thing to mention with relation to Reshuffle/RequiresStableinput and that

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Jan Lukavský
On 10/6/23 15:11, Kenneth Knowles wrote: On Fri, Oct 6, 2023 at 3:20 AM Jan Lukavský wrote: Hi, there is also one other thing to mention with relation to Reshuffle/RequiresStableinput and that is that our current implementation of RequiresStableInput can break without

Re: Reshuffle PTransform Design Doc

2023-10-06 Thread Jan Lukavský
Hi, there is also one other thing to mention with relation to Reshuffle/RequiresStableinput and that is that our current implementation of RequiresStableInput can break without Reshuffle in some corner cases on most portable runners, at least with Java GreedyPipelineFuser, see [1]. The only

Re: [VOTE] Release 2.51.0, release candidate #1

2023-10-05 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink Runner on own test-cases.  Jan On 10/4/23 21:10, Bruno Volpato via dev wrote: +1 (non-binding). Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java SDK 11, Dataflow Runner using both legacy and v2). Thanks Kenn! On Wed, Oct

Re: [ANNOUNCE] New PMC Member: Robert Burke

2023-10-04 Thread Jan Lukavský
Congrats Robert! On 10/4/23 10:29, Alexey Romanenko wrote: Congrats Robert, very well deserved! — Alexey On 4 Oct 2023, at 00:39, Austin Bennett wrote: Thanks for all you do @Robert Burke  ! On Tue, Oct 3, 2023 at 12:53 PM Ahmed Abualsaud wrote: Congrats

Re: [ANNOUNCE] New PMC Member: Valentyn Tymofieiev

2023-10-04 Thread Jan Lukavský
Congrats Valentyn! On 10/4/23 10:26, Alexey Romanenko wrote: Congrats Valentyn, very well deserved! — Alexey On 4 Oct 2023, at 00:39, Austin Bennett wrote: Thanks for everything @Valentyn Tymofieiev  ! On Tue, Oct 3, 2023 at 12:53 PM Ahmed Abualsaud wrote:

Re: [ANNOUNCE] New PMC Member: Alex Van Boxel

2023-10-04 Thread Jan Lukavský
Congrats Alex! On 10/4/23 10:29, Alexey Romanenko wrote: Congrats Alex, very well deserved! — Alexey On 4 Oct 2023, at 00:38, Austin Bennett wrote: Thanks for all you do, @Alex Van Boxel  ! On Tue, Oct 3, 2023 at 12:50 PM Ahmed Abualsaud via dev wrote:

Re: Runner Bundling Strategies

2023-09-27 Thread Jan Lukavský
er bundles retry. On Wed, Sep 27, 2023 at 11:20 AM Jan Lukavský wrote: What is the reason to rely on StartBundle and not Setup in this case? If the life-cycle of bundle is not "closed" (i.e. start - finish), then it seems to be ill defined and Setup should do? I'm tr

Re: Runner Bundling Strategies

2023-09-27 Thread Jan Lukavský
robably have to include @StartBundle in that consideration. Kenn On Tue, Sep 26, 2023 at 8:54 AM Kenneth Knowles wrote: On Mon, Sep 25, 2023 at 1:19 PM Jan Lukavský wrote: Hi Kenn and Reuven, I agree wi

Re: Runner Bundling Strategies

2023-09-25 Thread Jan Lukavský
eam/issues/28650 On 9/25/23 18:31, Reuven Lax via dev wrote: On Mon, Sep 25, 2023 at 6:19 AM Jan Lukavský wrote: On 9/23/23 18:16, Reuven Lax via dev wrote: Two separate things here: 1. Yes, a watermark can update in the middle of a bundle. 2. The records in the bundle

Re: Runner Bundling Strategies

2023-09-25 Thread Jan Lukavský
watermark propagation until a checkpoint (which is typically the order of seconds). This delay would add up after each stage. Reuven On Sat, Sep 23, 2023 at 12:03 AM Jan Lukavský wrote: > Watermarks shouldn't be (visibly) advanced until @FinishBundle is committed, as there's

Re: Runner Bundling Strategies

2023-09-23 Thread Jan Lukavský
rks is runner-dependent (e.g. Flink does not store watermarks in checkpoints, they are always recomputed from scratch on restore). [1] https://lists.apache.org/thread/10db7l9bhnhmo484myps723sfxtjwwmv On 9/22/23 21:47, Robert Bradshaw via dev wrote: On Fri, Sep 22, 2023 at 10:58 AM Jan Lukav

Re: Runner Bundling Strategies

2023-09-22 Thread Jan Lukavský
On 9/22/23 18:07, Robert Bradshaw via dev wrote: On Fri, Sep 22, 2023 at 7:23 AM Byron Ellis via dev wrote: I've actually wondered about this specifically for streaming... if you're writing a pipeline there it seems like you're often going to want to put high fixed cost things

Re: Runner Bundling Strategies

2023-09-22 Thread Jan Lukavský
basically) On Fri, Sep 22, 2023 at 5:09 AM Jan Lukavský wrote: Flink defines bundles in terms of number of elements and processing time, by default 1000 elements or 1000 milliseconds, whatever happens first. But bundles are not a "natural" concept in Flink,

Re: Runner Bundling Strategies

2023-09-22 Thread Jan Lukavský
Flink defines bundles in terms of number of elements and processing time, by default 1000 elements or 1000 milliseconds, whatever happens first. But bundles are not a "natural" concept in Flink, it uses them merely to comply with the Beam model. By default, checkpoints are unaligned with

Re: Stateful Beam Job with Flink Runner - Checkpoint Size Increasing Over Time

2023-09-19 Thread Jan Lukavský
Hi, Hemant, can you please share the code of the Pipeline? Do you use side inputs? Besides what Kenn already described: > 2.  When is the state information cleared on the WindowDoFn (TUMBLE windows)  on window closure ? When will global states and timers get cleared? The state and timers

Re: [ANNOUNCE] New committer: Ahmed Abualsaud

2023-08-25 Thread Jan Lukavský
Congrats Ahmed! On 8/25/23 07:56, Anand Inguva via dev wrote: Congratulations Ahmed :) On Fri, Aug 25, 2023 at 1:17 AM Damon Douglas wrote: Well deserved! Congratulations, Ahmed! I'm so happy for you. On Thu, Aug 24, 2023, 5:46 PM Byron Ellis via dev wrote:

Re: [VOTE] Release 2.49.0, release candidate #2

2023-07-13 Thread Jan Lukavský
+1 (binding) Tested Java SDK with FlinkRunner.  Jan On 7/13/23 02:30, Bruno Volpato via dev wrote: +1 (non-binding). Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java SDK 11, Dataflow runner). Thanks Yi! On Tue, Jul 11, 2023 at 4:23 PM Yi Hu via dev wrote:

Re: [DISCUSS] Enable Github Discussions?

2023-07-04 Thread Jan Lukavský
-1 Totally agree with Byron and Alexey.  Jan On 7/3/23 21:18, Byron Ellis via dev wrote: -1. This just leads to needless fragmentation not to mention being at the mercy of a specific technology provider. On Mon, Jul 3, 2023 at 11:39 AM XQ Hu via dev wrote: +1 with GH discussion.

Re: Is Flink >1.14 really supported by the runners?

2023-06-13 Thread Jan Lukavský
Probably better for dev@ <mailto:dev@beam.apache.org> (added).  Jan On 6/13/23 12:43, Edgar H wrote: Got you, thanks! Are there any plans on supporting 1.17 anytime soon too? El mar, 13 jun 2023, 12:27, Jan Lukavský escribió: Hi Edgar, the website seems to be mist

Re: Watermark Alignment on Flink Runner's UnboundedSourceWrapper

2023-05-23 Thread Jan Lukavský
  Custom Watermark ? Do I need to set AutoWatermarkInterval for stateful Beam Flink Jobs. Or Beam timers can handle it without setting that param ? Thanks On Tue, May 23, 2023 at 12:03 AM Jan Lukavský wrote: Hi Talat, your analysis is correct, aligning watermarks for jobs with high

Re: Watermark Alignment on Flink Runner's UnboundedSourceWrapper

2023-05-23 Thread Jan Lukavský
Hi Talat, your analysis is correct, aligning watermarks for jobs with high watermark skew in input partitions really results in faster checkpoints and reduces the size of state. There are generally two places you can implement this - in user code (the source) or inside runner. The user code

Re: [VOTE] Release 2.47.0, release candidate #3

2023-05-10 Thread Jan Lukavský
+1 (binding) Tested with Java SDK and FlinkRunner.  Jan On 5/9/23 08:44, Chamikara Jayalath via dev wrote: Verified that new containers are valid. Changing my vote to +1 Thanks for fixing this Jack. - Cham On Mon, May 8, 2023 at 2:05 PM Jack McCluskey wrote: I've spent the day

Re: Thoughts on coder evolution

2023-05-04 Thread Jan Lukavský
the Pipeline (state). Not input/output formats. On Wed, May 3, 2023 at 6:58 AM Jan Lukavský wrote: Hi, I'd like to discuss a topic, that from time to time appears in different contexts (e.g. [1]). I'd like restate the problem in a slightly more

Thoughts on coder evolution

2023-05-03 Thread Jan Lukavský
Hi, I'd like to discuss a topic, that from time to time appears in different contexts (e.g. [1]). I'd like restate the problem in a slightly more generic way as: "Should we have a way to completely exchange coder of a PCollection/state of a _running_ Pipeline?". First my motivation for this

Re: [ANNOUNCE] New committer: Damon Douglas

2023-04-25 Thread Jan Lukavský
Congrats Damon! On 4/25/23 06:15, Alex Kosolapov wrote: Congratulations, Damon! *From: *Kenneth Knowles *Reply-To: *"dev@beam.apache.org" *Date: *Monday, April 24, 2023 at 12:52 PM *To: *"dev@beam.apache.org" *Subject: *[EXTERNAL] [ANNOUNCE] New committer: Damon Douglas Hi all, Please

Re: Is there any way to set the parallelism of operators like group by, join?

2023-04-21 Thread Jan Lukavský
Let's use this thread to discuss how to configure a pipeline for runners so that they can scale workers appropriately without exposing runner-specific details to the Beam model. Ning. On Thu, Apr 20, 2023 at 1:41 PM Jan Lukavský wrote: Hi Ning,

Re: [ANNOUNCE] New committer: Anand Inguva

2023-04-21 Thread Jan Lukavský
Congrats Anand! On 4/21/23 20:05, Robert Burke wrote: Congratulations Anand! On Fri, Apr 21, 2023, 10:55 AM Danny McCormick via dev wrote: Woohoo, congrats Anand! This is very well deserved! On Fri, Apr 21, 2023 at 1:54 PM Chamikara Jayalath wrote: Hi all,

Re: [DISCUSS] @Experimental, @Internal, @Stable, etc annotations

2023-04-03 Thread Jan Lukavský
Hi, removing @Experimental and adding explicit @Stable annotation makes sense to me. FWIW, when we were designing Euphoria API, we adopted the following convention:  - the default stability of "evolving", @Experimental for really experimental code [1]  - target @Audience of API [2]

Re: [DESIGN] Beam Triggered side input specification

2023-03-29 Thread Jan Lukavský
own, but would also require some what more complicated logic in GBK (splitting window into panes, holding state for each pane independently, merging state for accumulating triggers, ...).   Jan On 3/28/23 17:26, Reuven Lax via dev wrote: On Tue, Mar 28, 2023 at 12:39 AM Jan Lukavský wr

Re: [DESIGN] Beam Triggered side input specification

2023-03-28 Thread Jan Lukavský
carry timestamp defined by TimestampCombiner. Makes sense now, thanks. +1  Jan On 3/28/23 09:39, Jan Lukavský wrote: On 3/27/23 19:44, Reuven Lax via dev wrote: On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský wrote: Hi, I'd like to clarify my understanding. Side inputs generally

Re: [DESIGN] Beam Triggered side input specification

2023-03-28 Thread Jan Lukavský
On 3/27/23 19:44, Reuven Lax via dev wrote: On Mon, Mar 27, 2023 at 5:43 AM Jan Lukavský wrote: Hi, I'd like to clarify my understanding. Side inputs generally perform a left (outer) join, LHS side is the main input, RHS is the side input. Not completely - it's more

Re: [DESIGN] Beam Triggered side input specification

2023-03-27 Thread Jan Lukavský
Hi, I'd like to clarify my understanding. Side inputs generally perform a left (outer) join, LHS side is the main input, RHS is the side input. Doing streaming left join requires watermark synchronization, thus elements from the main input are buffered until main_input_timestamp >

Re: [VOTE] Release 2.46.0, release candidate #1

2023-03-08 Thread Jan Lukavský
+1 (binding) Tested Java SDK with Flink and Spark 3 runner. Thanks,  Jan On 3/8/23 01:53, Valentyn Tymofieiev via dev wrote: +1. Verified the composition of Python containers and ran Python pipelines on Dataflow runner v1 and runner v2. On Tue, Mar 7, 2023 at 4:11 PM Ritesh Ghorse via dev

Re: Consuming one PCollection before consuming another with Beam

2023-02-28 Thread Jan Lukavský
In the case that the data is too large for side input, you could do the same by reassigning timestamps of the BQ input to BoundedWindow.TIMESTAMP_MIN_VALUE (you would have to do that in a stateful DoFn with a timer having outputTimestamp set to TIMESTAMP_MIN_VALUE to hold watermark, or using

Re: [ANNOUNCE] New PMC Member: Jan Lukavský

2023-02-17 Thread Jan Lukavský
Hi all, Please join me and the rest of the Beam PMC in welcoming Jan Lukavský as our newest PMC member. Jan has been a pa

Re: [VOTE] Release 2.44.0, release candidate #1

2023-01-11 Thread Jan Lukavský
+1 (non-binding) Tested Java SDK with Flink runner. Thanks,  Jan On 1/11/23 17:27, Bruno Volpato via dev wrote: +1 (non-binding) Tested with https://github.com/GoogleCloudPlatform/DataflowTemplates (Java SDK 11, Dataflow runner). Thanks! On Wed, Jan 11, 2023 at 11:08 AM Alexey

Re: SparkRunner - ensure SDF output does not need to fit in memory

2023-01-03 Thread Jan Lukavský
it on Spark not to use 2 thread execution and possibly apply memory pressure. On Mon, Jan 2, 2023 at 4:49 PM Jan Lukavský wrote: There are different translations of streaming and batch Pipelines in SparkRunner, this thread was focused on the batch part, if I understand it correctly

Re: SparkRunner - ensure SDF output does not need to fit in memory

2023-01-02 Thread Jan Lukavský
by the model to actually limit the amount of data produced in a bundle. If unsupported, then unbounded-per-element SDFs wouldn't be supported at all. -Daniel On Mon, Jan 2, 2023 at 7:46 AM Jan Lukavský wrote: Hi Jozef, I agree that this issue is most likely related to Spark for the reason

Re: SparkRunner - ensure SDF output does not need to fit in memory

2023-01-02 Thread Jan Lukavský
Hi Jozef, I agree that this issue is most likely related to Spark for the reason how Spark uses functional style for doing flatMap(). It could be fixed with the following two options:  a) SparkRunner's SDF implementation does not use splitting - it could be fixed so that the SDF is stopped

Re: @RequiresStableInput and Pipeline fusion

2022-12-14 Thread Jan Lukavský
/77af3237521d94f0399ab405ebac09bbbeded38c/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto#L111 On Tue, Dec 13, 2022 at 1:44 AM Jan Lukavský wrote: Hi, I have a question about @RequiresStableInput functionality. I'm trying to make it work for portable Flink runner [1

@RequiresStableInput and Pipeline fusion

2022-12-13 Thread Jan Lukavský
Hi, I have a question about @RequiresStableInput functionality. I'm trying to make it work for portable Flink runner [1], [2]. We have an integration test (which should probably be turned into Validates runner test, but that is a different story) [3]. The test creates random key for input

Re: Questions on primitive transforms hierarchy

2022-11-14 Thread Jan Lukavský
docs, which should be fine for most cases.  Jan On 11/14/22 15:25, Sachin Agarwal via dev wrote: Would it be helpful to add these answers to the Beam docs? On Mon, Nov 14, 2022 at 4:35 AM Jan Lukavský wrote: I somehow missed these answers, Reuven and Kenn, thanks for the discussion

Re: Questions on primitive transforms hierarchy

2022-11-14 Thread Jan Lukavský
I somehow missed these answers, Reuven and Kenn, thanks for the discussion, it helped me clarify my understanding.  Jan On 10/26/22 21:10, Kenneth Knowles wrote: On Tue, Oct 25, 2022 at 5:53 AM Jan Lukavský wrote: > Not quite IMO. It is a subtle difference. Perh

Re: Questions on primitive transforms hierarchy

2022-10-25 Thread Jan Lukavský
ke to know that for sure. :)  Jan On 10/24/22 19:59, Kenneth Knowles wrote: On Mon, Oct 24, 2022 at 5:51 AM Jan Lukavský wrote: On 10/22/22 21:47, Reuven Lax via dev wrote: I think we stated that CoGroupbyKey was also a primitive, though in practice it's implemented in terms of G

Re: Questions on primitive transforms hierarchy

2022-10-24 Thread Jan Lukavský
On 10/22/22 21:47, Reuven Lax via dev wrote: I think we stated that CoGroupbyKey was also a primitive, though in practice it's implemented in terms of GroupByKey today. On Fri, Oct 21, 2022 at 3:05 PM Kenneth Knowles wrote: On Fri, Oct 21, 2022 at 5:24 AM Jan Lukavský wrote

Questions on primitive transforms hierarchy

2022-10-21 Thread Jan Lukavský
Hi, I have some missing pieces in my understanding of the set of Beam's primitive transforms, which I'd like to fill. First a quick recap of what I think is the current state. We have (basically) the following primitive transforms:  - DoFn (stateless, stateful, splittable)  - Window  -

Re: PeriodImpulse not updating watermark

2022-09-20 Thread Jan Lukavský
other issues with the transform, I'm happy to review a similar change in Java (assuming nobody with more context has objections) Thanks, Danny On Tue, Sep 20, 2022 at 5:15 AM Jan Lukavský wrote: Hi, looking into the code of PeriodicSequence (which is used by PeriodicImpulse

PeriodImpulse not updating watermark

2022-09-20 Thread Jan Lukavský
Hi, looking into the code of PeriodicSequence (which is used by PeriodicImpulse) it seems it uses SDF, but does not update downstream watermark using WatermarkEstimator in the @ProcessElement method. That seems to produce outputs only after the Pipeline terminates (at least it seems to work

  1   2   3   4   5   6   >