Re: [ANNOUNCE] New PMC Member: Valentyn Tymofieiev

2023-10-03 Thread Boyuan Zhang
Well deserved!! Congrats  On Tue, Oct 3, 2023 at 12:47 Byron Ellis via dev wrote: > Congrats! > > On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev > wrote: > >> Congrats Valentyn!! >> >> On Tue, Oct 3, 2023 at 2:59 PM Ahmet Altay via dev >> wrote: >> >>> Congratulations Valentyn! Well

Re: [ANNOUNCE] New PMC Member: Robert Burke

2023-10-03 Thread Boyuan Zhang
Congratulations  Well deserved! On Tue, Oct 3, 2023 at 12:48 Byron Ellis via dev wrote: > Congrats! > > On Tue, Oct 3, 2023 at 12:40 PM Danielle Syse via dev > wrote: > >> Congrats Rebo!!! Can't wait to work more closely with you. >> >> On Tue, Oct 3, 2023 at 2:56 PM Ahmet Altay via dev >>

Re: KafkaIO SDF reader not respecting consumer group persisted offsets

2021-07-20 Thread Boyuan Zhang
Hi Alisdair, There are several ways to configure SDF implementation (ReadSourceDescriptors/ReadFromKafkaDoFn) to commit the offset: - Set `enable.auto.commit` in your consumer config, or - Configure your KafkaIO with commitOffsetsInFinalize

Re: PreCommit tests not running

2021-06-07 Thread Boyuan Zhang
There is an infra issue ongoing: https://issues.apache.org/jira/browse/INFRA-21976, which affects all beam tests. On Mon, Jun 7, 2021 at 9:21 PM Reuven Lax wrote: > https://github.com/apache/beam/pull/14949 > > Java PreCommit has been pending for 2 days. Is something wrong > with Jenkins? > >

Re: [Question] Best Practice of Handling Null Key for KafkaRecordCoder

2021-06-03 Thread Boyuan Zhang
chance to revisit the KafkaRecordCoder implementation. On Thu, Jun 3, 2021 at 2:17 PM Chamikara Jayalath wrote: > > > On Thu, Jun 3, 2021 at 2:06 PM Boyuan Zhang wrote: > >> Supporting the x-lang boundary is a good point. So you are suggesting >> that: >> >>

Re: [Question] Best Practice of Handling Null Key for KafkaRecordCoder

2021-06-03 Thread Boyuan Zhang
Supporting the x-lang boundary is a good point. So you are suggesting that: 1. We make NullableCoder as a standard coder. 2. KafkaIO wraps the keyCoder with NullabeCoder directly if it requires. Is that correct? On Wed, Jun 2, 2021 at 6:47 PM Chamikara Jayalath wrote: > I think we

Re: beam dev community

2021-05-24 Thread Boyuan Zhang
Welcome! Please checkout the contribution guide if you plan to contribute to beam: https://beam.apache.org/contribute/ On Mon, May 24, 2021 at 11:30 AM shahnawaz aziz wrote: > Please add me to this community as we are using the beam for our big data > project. > > Thanks >

Re: Question about SplittableDoFn

2021-05-19 Thread Boyuan Zhang
Thanks for sharing. I'll comment on the PR. On Tue, May 18, 2021 at 3:44 PM Miguel Anzo Palomo wrote: > Boyuan Zhang, It's about this issue > <https://issues.apache.org/jira/browse/BEAM-11996>, the code can be found > here <https://github.com/apache/beam/pull/14811/files&g

Re: Question about SplittableDoFn

2021-05-18 Thread Boyuan Zhang
Would you like to share your draft code? Iterating on the code might be easier to figure out the issue. On Tue, May 18, 2021 at 3:28 PM Robert Burke wrote: > IIRC the Initial Restrictions method gives you an element and you return > the restrictions relative to that element. > > It's entirely

Re: Extremely Slow DirectRunner

2021-05-12 Thread Boyuan Zhang
the pipeline took *2 >>>>>> seconds*: >>>>>> >>>>>> *May 12, 2021 11:03:39 A.M.* com.myOrg.myPipeline.PipelineLeg$1 >>>>>> processElement >>>>>> INFO: Got file pattern: gs://my-bucket/my-dir/5004728247517184/

Re: [DISCUSS] Warn when KafkaIO is used as a bounded source

2021-05-10 Thread Boyuan Zhang
Just added more details on BEAM-6466 . In short, BEAM-6466 looks more like a FR instead of a bug to me. On Fri, Apr 30, 2021 at 12:48 PM Pablo Estrada wrote: > I suppose a production-ready

Re: Extremely Slow DirectRunner

2021-05-10 Thread Boyuan Zhang
Hi Evan, What do you mean startup delay? Is it the time that from you start the pipeline to the time that you notice the first output record from PubSub? On Sat, May 8, 2021 at 12:50 AM Ismaël Mejía wrote: > Can you try running direct runner with the option >

Re: Window Assignment Across SplittableDoFn

2021-05-05 Thread Boyuan Zhang
Hi, Yes, just like normal DoFn, Splittable DoFn preserves the window information as well. On Wed, May 5, 2021 at 8:04 PM Evan Galpin wrote: > Hi folks, > > I’d just like to confirm what happens to window assignments through a > SplittableDoFn. Are output elements automatically assigned to the

Re: Question about transformOverride

2021-04-20 Thread Boyuan Zhang
+1 to use pipeline options. Alternatively, you can also change your KafkaReadTransform to perform different expansion(override expand()) based on your pipeline options. On Tue, Apr 20, 2021 at 9:51 PM Reuven Lax wrote: > It would be simpler to create a custom pipeline option, and swap out the

Re: BEAM-3415 JUnit5 support

2021-03-22 Thread Boyuan Zhang
Hi Emils, Thanks for your contribution. Luke is on vacation till June. I'm adding some java folks into your PR, who might be able to help review your PR as well. On Mon, Mar 22, 2021 at 9:17 AM Emils Solmanis wrote: > Hi all, > > The JUnit5 (BEAM-3415

Re: User-related questions in dev@ list

2021-03-10 Thread Boyuan Zhang
I subscribed to both user@ and dev@. It seems to me that we have more active folks who have knowledge to offer help on the dev@ than the user@. It's always good to have dev@ and user@ for different usages as long as dev@ and user@ are almost equivalent helpful for dev/users to look for help. On

Re: Java Postcommit: MapClassIntegrationIT has been failing

2021-03-04 Thread Boyuan Zhang
+Reuven Lax Reuven, do you have some insights on why the test is failing? On Thu, Mar 4, 2021 at 12:28 PM Tomo Suzuki wrote: > Hi Beam developers, > > MapClassIntegrationIT in the Java Postcommit job has been failing since > Feb 27th (5 days ago). > >

Re: Issue when FnApiDoFnRunner executes Read.BoundedSourceAsSDFWrapperFn

2021-02-04 Thread Boyuan Zhang
create a ticket to update Samza runner. > > Best, > Ke > > On Feb 4, 2021, at 12:07 PM, Boyuan Zhang wrote: > > Hi Ke, > > is it expected that Create.of will be expanded to a SDF > > In Java SDK, Create.of will be expanded into CreateSource, which will be > wrapped

Re: Issue when FnApiDoFnRunner executes Read.BoundedSourceAsSDFWrapperFn

2021-02-04 Thread Boyuan Zhang
Hi Ke, is it expected that Create.of will be expanded to a SDF In Java SDK, Create.of will be expanded into CreateSource, which will be wrapped into SDF implementation. with regular pardo:v1 urn? No, the runner should run SplittableParDoExpander[1] to expand SDF into

[Proposal] Portable OrderedListState

2021-02-03 Thread Boyuan Zhang
Hi team, I'm working on supporting OrderedListState over fnapi and I'm starting the design proposal here: doc . This doc focuses on fnapi proto changes and Java SDK harness support. Please feel

Re: Separating Data from Kafka by Keying Strategy in a Kafka Splittable DoFn

2021-02-01 Thread Boyuan Zhang
nce it is using an > event-time property to handle that, that multiple tenants/source > combinations could impact others, cause windows to close unexpected/early, > data to be missed, etc. This is why I believe that perhaps a SDF that was > evaluated prior to reading from Kafka could allow m

Re: Beam support Flink Async I/O operator

2021-01-26 Thread Boyuan Zhang
+dev On Tue, Jan 26, 2021 at 1:07 PM Eleanore Jin wrote: > Hi community, > > Does Beam support Flink Async I/O operator? if so, can you please share > the doc, and if not, is there any workaround to achieve the same in Beam > semantics? > > >

Re: outputReceiver.output() does not emit the result immediately

2021-01-26 Thread Boyuan Zhang
+dev Hi Yu, Which runner are you using for your pipeline? Also it would be helpful to share your pipeline code as well. On Mon, Jan 25, 2021 at 10:19 PM wrote: > Hi Beam Community, > > I have a splittable `DoFn` that reads message from some stream and output > the result to down stream. The

Re: [ANNOUNCE] New PMC Member: Chamikara Jayalath

2021-01-22 Thread Boyuan Zhang
Congrats Cham! On Fri, Jan 22, 2021 at 9:42 AM Yichi Zhang wrote: > Congrats Cham! > > On Fri, Jan 22, 2021 at 8:34 AM Alexey Romanenko > wrote: > >> Congrats, Cham! Thank you for your work! >> >> On 22 Jan 2021, at 09:53, Gleb Kanterov wrote: >> >> Congratulations! >> >> On Fri, Jan 22, 2021

Re: [ANNOUNCE] New committer: Piotr Szuberski

2021-01-22 Thread Boyuan Zhang
Congrats Piotr! On Fri, Jan 22, 2021 at 10:47 AM Yichi Zhang wrote: > Congrats Piotrek! > > On Fri, Jan 22, 2021 at 10:02 AM Robert Burke wrote: > >> Congrats Piotr! >> >> On Fri, Jan 22, 2021, 10:00 AM Tobiasz Kędzierski < >> tobiasz.kedzier...@polidea.com> wrote: >> >>> Congrats Piotrek! >>>

Re: Support Kafka Commit Transform for Python sdk x-lang

2021-01-20 Thread Boyuan Zhang
I opened https://github.com/apache/beam/pull/13779 for exxposing built-in timestamp policy and commitOffsetInFinalize to ReadFromKafka. On Fri, Jan 15, 2021 at 3:09 PM Chamikara Jayalath wrote: > > > On Fri, Jan 15, 2021 at 2:55 PM Boyuan Zhang wrote: > >> Re Cham, >&

Re: Problems with E2E test

2021-01-18 Thread Boyuan Zhang
It does seem like the Dataflow will do some validation around PubSub params before actually creating the pipeline. That's fair for Dataflow because Dataflow will swap the PubSubIO from beam implementation into Dataflow native one. I think if you really want to run your virtual PubSub with

Re: Support Kafka Commit Transform for Python sdk x-lang

2021-01-15 Thread Boyuan Zhang
much work would need to output KafkaRecord to python SDK? On Fri, Jan 15, 2021 at 2:44 PM Chamikara Jayalath wrote: > Thanks for bringing this up Sameer. > > On Fri, Jan 15, 2021 at 1:18 PM Boyuan Zhang wrote: > >> +Chamikara Jayalath >> >> Hi Sameer,

Re: Support Kafka Commit Transform for Python sdk x-lang

2021-01-15 Thread Boyuan Zhang
+Chamikara Jayalath Hi Sameer, Thanks for reaching out! We will expose *commitOffsetsInFinalize *to py ReadFromKafka transform when we have CustomWindow support in python SDK, which should be coming soon. In terms of *TimestampPolicyFactory*,if you are using the built-in types, like

Re: Huge memory usage increase after 2.26.

2021-01-14 Thread Boyuan Zhang
Thanks for reporting such an issue. Do you happen to have a heap dump when OOM happens? Than might help us to identify which part causes huge memory usage/loss. On Thu, Jan 14, 2021 at 8:35 AM Yuhong Cheng wrote: > Hi Beam, > > We are running a beam pipeline on spark, whose pipeline is mainly

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2021-01-12 Thread Boyuan Zhang
3K TPS > >>>>>> - with --experiments="beam_fn_api,use_sdf_kafka_read" > >>>>>> --fasterCopy=true, I am able to achieve 10K > >>>>>> - with --fasterCopy=true alone, I am only able to achieve 5K TPS > >>>>>> > >>

Re: Null checking in Beam

2021-01-11 Thread Boyuan Zhang
Yeah it seems like the checker is enabled: https://issues.apache.org/jira/browse/BEAM-10402. I used @SuppressWarnings({"nullness" )}) to suppress the error when I think it's not really a concern. On Mon, Jan 11, 2021 at 8:28 PM Reuven Lax wrote: > Has extra Nullable checking been enabled in the

Re: Understanding KafkaIO.Read In Batch Mode

2021-01-11 Thread Boyuan Zhang
fsetsInFinalize() > .withConsumerConfigUpdates(map) > .commitOffsetsInFinalize() > .withoutMetadata(); > > > [1] > https://github.com/apache/beam/blob/master/runners/spark/src/main/java/org/apache/beam/runners/spark/translation/MultiDoFnFunction.java#L24

Re: please add me in beam project

2021-01-11 Thread Boyuan Zhang
Hi Keisuke, Welcome! For creating a PR, there is no additional permission required. You may want to create a JIRA account and ask for permission for task tracking. You may also want to check this out to get familiar with how to do contribution: https://beam.apache.org/contribute/ On Mon, Jan 11,

Re: Understanding KafkaIO.Read In Batch Mode

2021-01-11 Thread Boyuan Zhang
+dev Hi Shrikant, If you look into the expansion of BoundedReadfromUnboundedSource[1], you will notice that it will expand into Create single shard -> Split into multiple shard -> read from one shard. The number of records from one shard will not be larger than 1 and the number of shards

Support of KafkaIO Dynamic Read

2021-01-07 Thread Boyuan Zhang
Hi team, I'm working on KafkaIO dynamic read support which is tracked by BEAM-11325 and I started the documentation here: https://docs.google.com/document/d/1FU3GxVRetHPLVizP3Mdv6mP5tpjZ3fd99qNjUI5DT5k/edit?usp=sharing, which states the problem I

Re: Compatibility between Beam v2.23 and Beam v2.26

2021-01-05 Thread Boyuan Zhang
https://github.com/apache/beam/pull/13240 seems suspicious to me. +Maximilian Michels Any insights here? On Tue, Jan 5, 2021 at 8:48 AM Antonio Si wrote: > Hi, > > I would like to followup with this question to see if there is a > solution/workaround for this issue. > > Thanks. > > Antonio.

Re: Add E2E test for Kafka to Pub/Sub complete example

2020-12-30 Thread Boyuan Zhang
Thanks for your contribution! I can help review PRs related to this topic. On Wed, Dec 30, 2020 at 7:31 AM Ramazan Yapparov < ramazan.yappa...@akvelon.com> wrote: > Hi Beam Community, > > Recently our team added Kafka to Pub/Sub example pipeline to Apache Beam > repository. > We decided to move

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-28 Thread Boyuan Zhang
Hi Steve, We have one wrapper optimization[1] merged in and it will be released with 2.27.0. Would you like to verify whether it helps improve the performance on DirectRunner? [1] https://github.com/apache/beam/pull/13592 On Mon, Dec 28, 2020 at 12:17 PM Boyuan Zhang wrote: > Hi Anto

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-28 Thread Boyuan Zhang
a lot higher traffics than others. We look at the case > with --experiments="beam_fn_api,use_sdf_kafka_read" --fasterCopy=true > options a little. Based on our observation, each consumer poll() in > ReadFromKafkaDoFn.processElement() takes about 0.8ms. So for topic with > high

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-21 Thread Boyuan Zhang
frequency. On Mon, Dec 21, 2020 at 1:16 PM Jan Lukavský wrote: > Sure. My ID is je-ik. > > Thanks, > > Jan > On 12/21/20 8:43 PM, Boyuan Zhang wrote: > > Thanks for your explanation, Jan. Now I can see what you mean here. I can > try to have a PR to do such optimizati

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-21 Thread Boyuan Zhang
ely keep the last CheckpointMark, because it is better safe than >> sorry :)) >> >> Jan >> On 12/21/20 7:54 PM, Boyuan Zhang wrote: >> >> Hi Jan, >>> >>> it seems that what we would want is to couple the lifecycle of the >>> Reader

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-17 Thread Boyuan Zhang
thods - namely open and close, which is what defines the scope of the > object. I think still think it should be possible to implement that > generally. > > Jan > On 12/17/20 11:19 PM, Boyuan Zhang wrote: > > Hi Jan, thanks for the quick followup! > > I'm not sure if I see the diffe

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-17 Thread Boyuan Zhang
Flink has use cases, where user really > *wants* to configure quite often checkpoints (has relation to how Flink > implements @RequiresStableInput). > > Jan > On 12/17/20 9:04 PM, Boyuan Zhang wrote: > > Sorry for the confusion. > > >> Are you saying it

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-17 Thread Boyuan Zhang
duce checkpoint frequency? > > On Thu, Dec 17, 2020 at 10:39 AM Boyuan Zhang wrote: > >> Thanks for your investigation, Steve! It seems like preventing the >> checkpoint from happening so frequently would be one workaround for you. >> Making the checkpoint frequency confi

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-17 Thread Boyuan Zhang
affects everyone, but >> > well that's a subject worth its own thread. >> > >> > On Thu, Dec 17, 2020 at 10:55 AM Jan Lukavský wrote: >> >> Hi, >> >> >> >> from my point of view the number in DirectRunner are set correctly. >> Pri

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-16 Thread Boyuan Zhang
ndles being >> committed by ParDoEvaluator.finishBundle, but after that, they seem to just >> kind of disappear somewhere. >> >> On Wed, Dec 16, 2020 at 4:18 PM Boyuan Zhang wrote: >> >>> Making it as the PipelineOptions was my another proposal but it might >&

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-16 Thread Boyuan Zhang
; I know we are a bit anti knobs but maybe it makes sense to make this > configurable via PipelineOptions at least for Direct runner. > > On Wed, Dec 16, 2020 at 7:29 PM Boyuan Zhang wrote: > > > > I agree, Ismael. > > > > From my current investigation, the perform

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-16 Thread Boyuan Zhang
considering its removal (--experiments=use_deprecated_read) and > probably have consensus when this happens. > > > On Fri, Dec 11, 2020 at 11:33 PM Boyuan Zhang wrote: > > > > > From what I've seen, the direct runner initiates a checkpoint after > every elemen

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-11 Thread Boyuan Zhang
y 5 seconds will help? On Fri, Dec 11, 2020 at 2:22 PM Steve Niemitz wrote: > From what I've seen, the direct runner initiates a checkpoint after every > element output. > > On Fri, Dec 11, 2020 at 5:19 PM Boyuan Zhang wrote: > >> Hi Antonio, >> >> Thanks for th

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-11 Thread Boyuan Zhang
to wait for 0.8ms before it can process the next > batch of records again. > > Any suggestions would be appreciated. > > Hope that helps. > > Thanks and regards, > > Antonio. > > On 2020/12/04 19:17:46, Boyuan Zhang wrote: > > Opened https://issues.apache.

Re: Dynamic timers in python sdk.

2020-12-10 Thread Boyuan Zhang
Thanks for the explanation! That makes sense. We may also want to update pydoc to state the usage explicitly since it's quite different from how Java SDK does it. On Thu, Dec 10, 2020 at 11:13 AM Robert Bradshaw wrote: > On Thu, Dec 10, 2020 at 11:05 AM Boyuan Zhang wrote: > >> We

Re: Dynamic timers in python sdk.

2020-12-10 Thread Boyuan Zhang
> > We should also consider whether the default used in set() should be the > empty string, or some value completely disjoint from any other string. (I'd > lean towards the latter.) >From API layer, the default value could be None for non-dynamic timer. When translating to timer data message,

Re: Throttling stream outputs per trigger?

2020-12-08 Thread Boyuan Zhang
I think your understanding is correct. Does the CommitOffset transform have side-effects on your pipeline? On Tue, Dec 8, 2020 at 3:35 PM Vincent Marquez wrote: > > *~Vincent* > > > On Tue, Dec 8, 2020 at 3:13 PM Boyuan Zhang wrote: > >> Please note that

Re: Throttling stream outputs per trigger?

2020-12-08 Thread Boyuan Zhang
ent Marquez wrote: > > *~Vincent* > > > On Tue, Dec 8, 2020 at 1:34 PM Boyuan Zhang wrote: > >> Hi Vicent, >> >> Window.into(FixedWindows.of(Duration.standardMinutes(5))) operation just >> applies the window information to each element, not really

Re: Throttling stream outputs per trigger?

2020-12-08 Thread Boyuan Zhang
Hi Vicent, Window.into(FixedWindows.of(Duration.standardMinutes(5))) operation just applies the window information to each element, not really does the grouping operation. And in the commit transform, there is a combine transform applied(Max.longsPerKey()).

Re: Proposal: Redis Stream Connector

2020-12-04 Thread Boyuan Zhang
ow to write a test > for my PTransform that uses a Splittable DoFn. > > Thanks again. > > > *~Vincent* > > > On Mon, Nov 30, 2020 at 10:25 AM Boyuan Zhang wrote: > >> In Splittable DoFn, the trySplit[1] API in RestrictionTracker is for >> performing check

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-04 Thread Boyuan Zhang
Opened https://issues.apache.org/jira/browse/BEAM-11403 for tracking. On Fri, Dec 4, 2020 at 10:52 AM Boyuan Zhang wrote: > Thanks for the pointer, Steve! I'll check it out. The execution paths for > UnboundedSource and SDF wrapper are different. It's highly possible that > the r

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-04 Thread Boyuan Zhang
slack.com/archives/C9H0YNP3P/p1607057900393900 > > On Thu, Dec 3, 2020 at 1:56 PM Boyuan Zhang wrote: > >> Hi Steve, >> >> I think the major performance regression comes from >> OutputAndTimeBoundedSplittableProcessElementInvoker[1], which will >> checkpoint

[Proposal] Remove @Experimental from Splittable DoFn APIs

2020-12-03 Thread Boyuan Zhang
Hi folks, As we are reaching a stable state on Splittable DoFn APIs both in Java and Python SDK, I'm proposing to remove Experimental annotations from these APIs. I have opened one PR[1] to do so. Please feel free to drop any comments on that PR. Thanks for your help! [1]

Re: Usability regression using SDF Unbounded Source wrapper + DirectRunner

2020-12-03 Thread Boyuan Zhang
Hi Steve, I think the major performance regression comes from OutputAndTimeBoundedSplittableProcessElementInvoker[1], which will checkpoint the DoFn based on time/output limit and use timers/state to reschedule works. [1]

Re: Proposal: Redis Stream Connector

2020-11-30 Thread Boyuan Zhang
really needed >> considering that we don't have many Unbounded SDF connectors to use as >> reference. >> >> On Thu, Nov 19, 2020 at 11:16 PM Boyuan Zhang wrote: >> > >> > >> > >> > On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez < >&

Re: Create External Transform with WindowFn

2020-11-30 Thread Boyuan Zhang
rke wrote: >>> >>>> Coders should only be checked over the language boundaries. >>>> >>>> On Wed, Aug 26, 2020, 6:24 PM Boyuan Zhang wrote: >>>> >>>>> Thanks Cham! >>>>> >>>>> I just realized that

Re: CheckDone in unbounded SDF?

2020-11-28 Thread Boyuan Zhang
ker subclass with offsets already claimed? Does the runtime >>> call (and importantly, is it required to call): >>> >>> MyRestrictionTracker tracker = >>> MyDoFn.newTracker(splitResult.getPrimary()); >>> tracker.tryClaim(previousClaimed); >&g

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
completed if the trySplit call was triggered by a call > to resume()? > > On Sat, Nov 28, 2020 at 1:58 AM Boyuan Zhang wrote: > >> And if you look into the RestrictionTracker javadoc[1], it mentions that >> what means when you return null from trySplit. >> >> [1] >

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
at 10:54 PM Boyuan Zhang wrote: > To the extent I can see, this never mentions the restriction that you >> to implement a split() that returns a bounded restriction if >> returning resume() from an SDF. Nor does this restriction particularly make >> sense if the range bein

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
> > > Please note that the when an SDF is processing one element restriction > pair, the start of the restriction is never changed. You will always get > the same offset when you call currentRestriction().getFrom(). > > This is good to know. So to rephrase, could I periodic

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
there any way I can yield control to the runtime > if I have no new data to enable rescheduling? For example, can I call > tracker.tryClaim(tracker.currentRestriction().getFrom()) ? > > On Sat, Nov 28, 2020 at 12:57 AM Boyuan Zhang wrote: > >> > IIUC, this should never happen as l

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
972f8d0dd1db4c4ce38e702e4e9f6a88ef69c2f1eab9870d9512cbe48e22ce67> > for > the sdf and restriction tracker implementations) > > -Dan > > On Sat, Nov 28, 2020 at 12:42 AM Boyuan Zhang wrote: > >> >> >> On Fri, Nov 27, 2020 at 9:34 PM Daniel Collins &g

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
rtition); > return ProcessContinuation.stop(); > } > } > } finally { > committer.stopAsync().awaitTerminated(); > } > ``` > >From your code, what will you do if we reach the finally block? Would you like to open a WIP PR to show mo

Re: CheckDone in unbounded SDF?

2020-11-27 Thread Boyuan Zhang
Hi Daniel, The checkDone is invoked by the SDK harness to guarantee that when you exit you SplittableDoFn.process(either you return stop() or resume()), you must have completed all the work in the current restriction. This is one of major ways for SplittableDoFn to prevent data loss. In your

Re: Implementing an IO Connector for Debezium

2020-11-27 Thread Boyuan Zhang
you think we need rewrite this section completely according to a large > progress with moving to SDF-based connectors in last time? Though, it would > be useful to keep an old (current) one since Source API is still used. > > > [1] https://beam.apache.org/documentation/io/developing-io

Re: Implementing an IO Connector for Debezium

2020-11-25 Thread Boyuan Zhang
+dev Hi Bashir, Most recently we are recommending to use Splittable DoFn[1] to build new IO connectors. We have several examples for that in our codebase: Java examples: - Kafka

Re: RabbitMq Read - fail ack message

2020-11-23 Thread Boyuan Zhang
Thanks for updating the stackoverflow thread. I just filed the jira here: https://issues.apache.org/jira/browse/BEAM-11328 for tracking progress. On Mon, Nov 23, 2020 at 4:10 PM Rafael Ribeiro wrote: > @Boyuan Zhang you make my day happy. > > The workaround worked as a charm > >

Re: RabbitMq Read - fail ack message

2020-11-23 Thread Boyuan Zhang
As a workaround, you can add --experiments=use_deprecated_read when launching your pipeline to bypass the sdf unbounded source wrapper here. On Mon, Nov 23, 2020 at 2:52 PM Boyuan Zhang wrote: > Hi Rafael, > > As you mentioned, within withMaxNumRecords, the unbounded source will be &

Re: RabbitMq Read - fail ack message

2020-11-23 Thread Boyuan Zhang
Hi Rafael, As you mentioned, within withMaxNumRecords, the unbounded source will be executed as the bounded one. It may not be ideal for you. It seems like a bug for direct runner and sdf unbounded source wrapper when doing finalizeCheckpoint. Do you want to file a JIRA on this problem? On Mon,

Re: Proposal: Redis Stream Connector

2020-11-19 Thread Boyuan Zhang
On Thu, Nov 19, 2020 at 1:29 PM Vincent Marquez wrote: > > > > On Thu, Nov 19, 2020 at 10:38 AM Boyuan Zhang wrote: > >> Hi Vincent, >> >> Thanks for your contribution! I'm happy to work with you on this when you >> contribute the code into Beam. >&

Re: Proposal: Redis Stream Connector

2020-11-19 Thread Boyuan Zhang
Hi Vincent, Thanks for your contribution! I'm happy to work with you on this when you contribute the code into Beam. Another thing is that it would be preferable to use Splittable DoFn instead of using UnboundedSource to

Re: Cross language pipeline example

2020-11-11 Thread Boyuan Zhang
I was taking KafkaIO as an example of x-lang when I was doing something related: - python transform: https://github.com/apache/beam/blob/e1fdb9884898a673285b3b932fa99ae019b9c7b5/sdks/python/apache_beam/io/kafka.py#L110 - java transform:

Re: Self-checkpoint Support on Portable Flink

2020-10-23 Thread Boyuan Zhang
e > > > program the user needs to set a special flag when enabling > > > checkpointing:|env.enableCheckpointing(interval, > > > CheckpointingMode.EXACTLY_ONCE, force = true)|. > > > > > > Please note that records in flight in the loop edges (and the > &

Re: Self-checkpoint Support on Portable Flink

2020-10-06 Thread Boyuan Zhang
applicable and could be used to implement >> dynamic splitting. >> >> -Max >> >> On 05.10.20 23:13, Luke Cwik wrote: >> > Thanks Boyuan, I left a few comments. >> > >> > On Mon, Oct 5, 2020 at 11:12 AM Boyuan Zhang > > <mailto:bo

Re: [DISCUSS] Clearing timers (https://github.com/apache/beam/pull/12836)

2020-09-18 Thread Boyuan Zhang
Hi Reuven, Would you like to share the links to potential fixes? We can figure out what we can do there. On Fri, Sep 18, 2020 at 4:21 PM Reuven Lax wrote: > > > On Fri, Sep 18, 2020 at 3:14 PM Luke Cwik wrote: > >> PR 12836[1] is adding support for clearing timers and there is a >> discussion

Re: Clear Timer in Java SDK

2020-09-03 Thread Boyuan Zhang
remove timers. > > On Wed, Sep 2, 2020 at 11:00 AM Boyuan Zhang wrote: > >> Hi team, >> >> I'm looking for something similar to timer.clear() from Python SDK[1] in >> Java SDK but it seems like we haven't exposed clearing timer API from Java >> Timer. Does J

Clear Timer in Java SDK

2020-09-02 Thread Boyuan Zhang
Hi team, I'm looking for something similar to timer.clear() from Python SDK[1] in Java SDK but it seems like we haven't exposed clearing timer API from Java Timer. Does Java SDK have another way to clear a timer or we just haven't worked on this API? [1]

Re: Create External Transform with WindowFn

2020-08-26 Thread Boyuan Zhang
hat use external WindowingStrategies might be failing during >> proto -> object -> proto conversion we do today. This limitation will go >> away once Dataflow directly starts reading Beam protos. We are working on >> this now. >> >> Thanks, >> Cham &g

Re: Create External Transform with WindowFn

2020-08-25 Thread Boyuan Zhang
wing operations (e.g. global, fixed, sliding, sessions) in an > external transform. You should also be able to window into an > arbitrary WindowFn as long as it produces standards window types, but > if there's a bug here you could possibly work around it by windowing > into a more st

Create External Transform with WindowFn

2020-08-25 Thread Boyuan Zhang
Hi team, I'm trying to create an External transform in Java SDK, which expands into several ParDo and a Window.into(FixWindow). When I use this transform in Python SDK, I get an pipeline construction error: apache_beam/utils/urns.py", line 186, in from_runner_api parameter_type, constructor

Re: Output timestamp for Python event timers

2020-08-12 Thread Boyuan Zhang
t; > > On Tue, Aug 11, 2020 at 12:36 PM Yichi Zhang > <mailto:zyi...@google.com>> wrote: > > > > +1 to expose set_output_timestamp and enrich python set timer api. > > > > On Tue, Aug 11, 2020 at 12:01 PM Boyuan Zhang > <mailto:boyu...@google.com>

Re: Output timestamp for Python event timers

2020-08-11 Thread Boyuan Zhang
Hi Maximilian, It makes sense to set hold_timestamp as fire_timestamp when the fire_timestamp is in the event time domain. Otherwise, the system may advance the watermark incorrectly. I think we can do something similar to Java FnApiRunner[1]: - Expose set_output_timestamp API to python

Re: Unknown accumulator coder error when running cross-language SpannerIO Write

2020-08-04 Thread Boyuan Zhang
Hi Piotr, Are you using the beam master head to dev? Can you share your code? The x-lang transform can be tested with Flink runner, where SDF is also supported, such as https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/flink_runner_test.py#L205-L261 On Tue,

Re: Needed help identifying a error in running a SDF

2020-08-04 Thread Boyuan Zhang
Hi Mayank, Which runner do you want to run your pipeline? You should add 'beam_fn_api' when you launch the pipeline --experiments=beam_fn_api. In your code: class TestDoFn(beam.DoFn): def process( self, element, restriction_tracker=beam.DoFn.RestrictionParam(

Re: KafkaIO sending KafkaRecords in CrossLanguage - where is the coder registered?

2020-07-16 Thread Boyuan Zhang
Hi Piotr, X-Lang uses TypedWithoutMetadata , which outputs the KV directly instead of KafkaRecord: see here

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-29 Thread Boyuan Zhang
language transforms is a slight tangent >>>> here. But I think, in general, it's great if we can use existing transforms >>>> (for example, IO connectors) as cross-language transforms without having to >>>> build more composites (irrespective of whether in ExternalTr

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-25 Thread Boyuan Zhang
duplication and error-prone approach related to this. It > makes much sense since usually we have all needed configuration set in Read > objects and, as Ismaeil mentioned, ReadAll will consist mostly of only > Split-Shuffle-Read stages. So this case usually can be unified by using >

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Boyuan Zhang
Sorry for the typo. I mean I think we can go with *(3)* and (4): use the data type that is schema-aware as the input of ReadAll. On Wed, Jun 24, 2020 at 7:42 PM Boyuan Zhang wrote: > Thanks for the summary, Cham! > > I think we can go with (2) and (4): use the data type that is sch

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Boyuan Zhang
whenever a PTransform is used >>> as a data object. >>> Also, agree with Boyuan that using simple Java objects that are >>> convertible to Beam Rows allow us to make these transforms available to >>> other SDKs through the cross-language transforms. Using transforms or &

Re: [DISCUSS] ReadAll pattern and consistent use in IO connectors

2020-06-24 Thread Boyuan Zhang
Hi Ismael, I think the ReadAll in the IO connector refers to the IO with SDF implementation despite the type of input, where Read refers to UnboundedSource. One major pushback of using KafkaIO.Read as source description is that not all configurations of KafkaIO.Read are meaningful to populate

Re: Python Cross-language wrappers for Java IOs

2020-06-15 Thread Boyuan Zhang
eam/coders/row_coder.py#L177-L189 >>> [2] >>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/RowCoderGenerator.java#L341-L356 >>> [3] >>> https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/bea

Re: Python Cross-language wrappers for Java IOs

2020-06-15 Thread Boyuan Zhang
Thanks Cham. Standard coder is a good point. Does it mean non-standard coder doesn't work when crossing language boundaries even if it is implemented in both Java and Python sdk? On Mon, Jun 15, 2020 at 10:08 AM Chamikara Jayalath wrote: > Thanks. +1 for using RowCoder. We should try to use

Re: [Discuss] Build Kafka read transform on top of SplittableDoFn

2020-06-15 Thread Boyuan Zhang
rce / available at >> this moment? >> >> Trying to implement here an IO to a custom source here using >> SplittableDoFn, and it would be helpful to see more examples :) >> >> Thanks, >> Pedro >> >> >> On 2020/05/29 02:16:49,

  1   2   >