Re: [Python SDK] Feedback for deferred side inputs + combiners

2024-04-11 Thread Joey Tran
elevant. >- proposed approach and considered alternatives >- any runner-specific considerations. > > Thanks, > Valentyn > > On Fri, Mar 29, 2024 at 5:06 AM Joey Tran > wrote: > >> I posted a PoC PR [1] for fixing deferred side inputs with combiners in >> the

Re: tox issues in dev container

2024-04-05 Thread Joey Tran
Yeah that was the tox command I was running On Fri, Apr 5, 2024, 4:37 PM XQ Hu via dev wrote: > > https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-LintandFormattingChecks > > This generally works well. Have you checked this? > > On Fri, Apr 5, 2024 at

tox issues in dev container

2024-04-05 Thread Joey Tran
I think I might be doing something silly with my environment. I'm trying to lint using tox in a dev container, but running tox ends with this error: ``` (env) jtran@[Beam Build Env.]:~/beam {flatmapdefault} ] $ tox File "/usr/lib/python3/dist-packages/tox/reporter.py", line 32, in __init__

[Python SDK] Feedback for deferred side inputs + combiners

2024-03-29 Thread Joey Tran
I posted a PoC PR [1] for fixing deferred side inputs with combiners in the python SDK. Would someone be willing to take a look at it? I have it working but could use some feedback on where to take it next. It looks like bundle processor combiner operations don't currently support side inputs [2]

Re: container dev environment: go get issue

2024-03-22 Thread Joey Tran
gt; would remove that 'go get' line. > > There's a different issue at play here too since it was written for > pre-module Go in mind. I'm unfamiliar with that script though. > > I'll take a proper look in a few hours. > > On Fri, Mar 22, 2024, 5:25 AM Joey Tran wrote: > >&

container dev environment: go get issue

2024-03-22 Thread Joey Tran
Hi, I've been banging my head trying to get a dev environment working. I gave up trying to get a local python environment working after I got some weird clang errors and proto generation issues so I've been trying to just use the docker container by running `bash start-build-env.sh` but I'm

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
; > seems a bit error prone. > > > On Thu, Mar 21, 2024 at 2:23 PM Joey Tran > wrote: > >> Ah, I misunderstood your original suggestion then. That makes sense then. >> I have already seen someone get a little confused about the names and >> surprised that Flatten does

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
ion time in Python if you pass a single > PCollection to Flatten. The scenario you describe concerns a one-element > list. > > On Thu, Mar 21, 2024, 13:43 Joey Tran wrote: > >> I think it'd be quite surprising if beam.Flatten would become equivalent >> to FlatMap if passed

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
t arg of lambda x: x is an interesting idea. The >> only downside I see is a less clear error if one forgets to provide this >> (now mandatory) parameter, but maybe that's low enough to be worth the >> convenience? >> >> On Thu, Mar 21, 2024 at 12:02 PM Joey Tran

Re: Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
wrote: > Hi, you can use beam.Flatten() instead. > > On Thu, Mar 21, 2024 at 10:55 AM Joey Tran > wrote: > >> Hey all, >> >> Using an identity function for FlatMap comes up more often than using >> FlatMap without an identity function. Would it make sense to use the >> identity function as a default? >> >> >> >>

Python API: FlatMap default -> lambda x:x?

2024-03-21 Thread Joey Tran
Hey all, Using an identity function for FlatMap comes up more often than using FlatMap without an identity function. Would it make sense to use the identity function as a default?

Re: Hiding logging for beam playground examples

2024-03-08 Thread Joey Tran
yat...@akvelon.com> wrote: > Hi Joey, > > Thanks for reaching out! I see that your changes haven't been deploed yet, > so I've triggered the corresponding job and Playground will be updated soon. > > Thanks, > Andrey > > > > *From: *Joey Tran > *Reply to: *&q

Re: Hiding logging for beam playground examples

2024-03-07 Thread Joey Tran
""""""""""""""""""""""""""""""""""""""""""""""""""

Issue building python SDK with M2 Mac

2024-03-07 Thread Joey Tran
Hey all, I'm trying to get a beam python SDK dev environment going but I'm a bit stuck. I'm just settings things up with a virtual env as specified in the docs[1], but `pip install -e .[gcp,test]` ends with a clang error: ``` clang -Wsign-compare -Wunreachable-code -fno-common -dynamic

Re: Playground: File Explorer?

2024-02-08 Thread Joey Tran
2.54.0. > > On Thu, Feb 8, 2024, 7:18 AM Joey Tran wrote: > >> Here's two: >> >> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python >> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python >> >> Also, how often does playground get

Re: Playground: File Explorer?

2024-02-08 Thread Joey Tran
sually we use beam.Map(print) to display some output values. > > On Wed, Feb 7, 2024 at 8:55 PM Joey Tran > wrote: > >> Hey all, >> >> I've been really trying to use Playground for educating new Beam users >> but it feels like there's something missing. A lot o

Playground: File Explorer?

2024-02-07 Thread Joey Tran
Hey all, I've been really trying to use Playground for educating new Beam users but it feels like there's something missing. A lot of examples (e.g. Multiple ParDo Outputs) for at least the python API don't seem to do anything observable. For example, the Multiple ParDo Outputs example writes to

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
groupbykey. Thanks for the clarification! On Fri, Jan 26, 2024 at 12:03 PM Robert Bradshaw wrote: > On Fri, Jan 26, 2024 at 8:43 AM Joey Tran > wrote: > > > > Hmm, I think I might still be missing something. CombinePerKey is made > up of "GBK() | CombineValues". P

Re: [python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
the first > worker will only emit [A, B] and the second [B, C] and only the B > needs to be deduplicated post-shuffle. > > Wouldn't hurt to have a comment to that effect there. > > https://beam.apache.org/documentation/programming-guide/#combine > > On Fri, Jan 26, 2024 at 8:22 AM Joey Tra

[python] Why CombinePerKey(lambda vs: None)?

2024-01-26 Thread Joey Tran
Hey all, I was poking around and looking at `Distinct` and was confused about why it was implemented the way it was. Reproduced here: @ptransform_fn @typehints.with_input_types(T) @typehints.with_output_types(T) def Distinct(pcoll): # pylint: disable=invalid-name """Produces a PCollection

Re: (python SDK) "Any" coder bypasses registry coders

2024-01-05 Thread Joey Tran
Oh actually, overriding the fallback coder doesn't actually do anything because the issue is not with the fallback coders in the registry but the fastprimitivescoder's fallback coder On Fri, Jan 5, 2024 at 12:42 PM Joey Tran wrote: > > I think my original message made it sound like

Re: (python SDK) "Any" coder bypasses registry coders

2024-01-05 Thread Joey Tran
in non-obvious downstream issues.* On Fri, Jan 5, 2024 at 12:05 PM Robert Bradshaw via dev wrote: > On Fri, Jan 5, 2024 at 7:38 AM Joey Tran > wrote: > >> I've been working with a few data types that are in practice >> unpicklable and I've run into a couple issues stemmin

(python SDK) "Any" coder bypasses registry coders

2024-01-05 Thread Joey Tran
I've been working with a few data types that are in practice unpicklable and I've run into a couple issues stemming from the `Any` type hint, which when used, will result in the PickleCoder getting used even if there's a coder in the coder registry that matches the data element. This was pretty

Re: Constant for beam:runner:executable_stage:v1 ?

2023-12-20 Thread Joey Tran
art of the model so much as an implementation > detail, but it likely does make sense to put somewhere common. > > On Wed, Dec 20, 2023 at 12:55 PM Joey Tran > wrote: > >> Hey all, >> >> Is there a particular reason we hard code >> "beam:runner:executable

Constant for beam:runner:executable_stage:v1 ?

2023-12-20 Thread Joey Tran
Hey all, Is there a particular reason we hard code "beam:runner:executable_stage:v1" everywhere in the python SDK instead of putting it in common_urns?

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
ow them over time. > > On Fri, Dec 15, 2023 at 5:57 AM Joey Tran > wrote: > >> Yeah I can confirm for the python runners (based on my reading of the >> translations.py [1]) that only identical environments are merged together. >> >> The funny thing is that we _origi

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
ge hints as they like. > Transform annotations might be an alternative, but how those are managed > would be more SDK specific. > > On Fri, Dec 15, 2023, 5:21 AM Joey Tran wrote: > >> I figured out my issue. I thought side inputs were breaking up my >> pipeline but after exp

Re: How do side inputs relate to stage fusion?

2023-12-15 Thread Joey Tran
oFns, those > are usually relegated to the root of a fused stage, and avoids fusions with > each other. That can also cause additional stages. > > If Beam adopted a rigorous notion of Key Preserving for transforms, > multiple stateful transforms could be fused in the same s

Re: How do side inputs relate to stage fusion?

2023-12-14 Thread Joey Tran
ons), but they principally boil down to > shapes that look like this. > > Though this does not introduce a global barrier in streaming, there is > still the analogous per window/watermark barrier that prevents fusion for > the same reasons. > > > > > On Thu, Dec 14, 20

How do side inputs relate to stage fusion?

2023-12-14 Thread Joey Tran
Hey all, We have a pretty big pipeline and while I was inspecting the stages, I noticed there is less fusion than I expected. I suspect it has to do with the heavy use of side inputs in our workflow. In the python sdk, I see that side inputs are considered when determining whether two stages are

Re: Hiding logging for beam playground examples

2023-11-16 Thread Joey Tran
d try to make it > crash and maybe find a stacktrace? Setting logging could like like so: > https://github.com/apache/beam/blob/729c4de416b8252ec99f0a1253ac7af3023733df/sdks/python/apache_beam/examples/wordcount.py#L110 > > On Wed, Nov 15, 2023 at 12:06 PM Joey Tran > wrote: >

Re: Hiding logging for beam playground examples

2023-11-15 Thread Joey Tran
>> +1 to at least setting the log level to higher than info. Some runner >> logging (e.g. job started/done) may be useful. >> >> On Tue, Nov 14, 2023 at 9:37 AM Joey Tran >> wrote: >> > >> > Hi all, >> > >> > I just had a worksho

Hiding logging for beam playground examples

2023-11-14 Thread Joey Tran
Hi all, I just had a workshop to demo beam for people at my company and there was a bit of confusion about whether the beam python playground examples were even working and it turned out they just got confused by all the runner logging that is output. Is this worth keeping? It seems like it'd be

Re: [PYTHON] partitioner utilities?

2023-10-23 Thread Joey Tran
PR for top: https://github.com/apache/beam/pull/29106 On Mon, Oct 23, 2023 at 10:11 AM XQ Hu via dev wrote: > +1 on this idea. Thanks! > > On Thu, Oct 19, 2023 at 3:40 PM Joey Tran > wrote: > >> Yeah, I already implemented these partitioners for my use case (I just >

Re: [QUESTION] Why no auto labels?

2023-10-20 Thread Joey Tran
che/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630 On Fri, Oct 13, 2023 at 1:32 PM Joey Tran wrote: > > > On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw > wrote: > >> On Fri, Oct 13, 2023 at 10:0

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
/github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191 > > On Thu, Oct 19, 2023 at 3:21 PM Joey Tran > wrote: > >> Yes, both need to be small enough to fit into state. >> >> Yeah a percentage sam

Re: [PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
> Danny > > On Thu, Oct 19, 2023 at 10:06 AM Joey Tran > wrote: > >> Hey all, >> >> While writing a few pipelines, I was surprised by how few partitioners >> there were in the python SDK. I wrote a couple that are pretty generic and >> possibl

[PYTHON] partitioner utilities?

2023-10-19 Thread Joey Tran
Hey all, While writing a few pipelines, I was surprised by how few partitioners there were in the python SDK. I wrote a couple that are pretty generic and possibly generally useful. Just wanted to do a quick poll to see if they seem useful enough to be in the sdk's library of transforms. If so, I

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Joey Tran
On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw wrote: > On Fri, Oct 13, 2023 at 10:08 AM Joey Tran > wrote: > Are there places on the SDK side that expect unique labels? Or in >> non-updateable runners? >> > > That's a good question. The label eventually ends up

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Joey Tran
ttribute old B_2's state > to the new B_2 (and also possibly mis-direct any inflight messages). At > least with the old, intersecting names we can detect this problem > rather than silently give corrupt data. > > > On Fri, Oct 13, 2023 at 7:15 AM Joey Tran > wrote: > >>

Re: [QUESTION] Why no auto labels?

2023-10-13 Thread Joey Tran
ior similar to java, I'm happy >> to put up a PR >> >> On Thu, Oct 5, 2023, 12:49 PM Joey Tran >> wrote: >> >>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be >>> a very sticky toggle since it'd be easy for PTransforms to

Re: [QUESTION] Why no auto labels?

2023-10-10 Thread Joey Tran
Bump on this. Sorry to pester - I'm trying to get a few teams to adopt Apache Beam at my company and I'm trying to foresee parts of the API they might find inconvenient. If there's a conclusion to make the behavior similar to java, I'm happy to put up a PR On Thu, Oct 5, 2023, 12:49 PM Joey Tran

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Joey Tran
ble > with an option now. We should probably add the option to toggle Python too. > (Unclear what the default should be, but this probably ties into > re-thinking how pipeline update should work.) > > On Thu, Oct 5, 2023 at 4:58 AM Joey Tran > wrote: > >> Makes sense th

Re: [QUESTION] Why no auto labels?

2023-10-05 Thread Joey Tran
alified transform name, so the > naming only has to be distinct within a composite transform (or at the top > level--the pipeline itself is isomorphic to a single composite transform). > > On Wed, Oct 4, 2023 at 3:43 AM Joey Tran > wrote: > >> Cross posting this thread to dev@

Re: [QUESTION] Why no auto labels?

2023-10-04 Thread Joey Tran
On Tue, Oct 3, 2023 at 9:15 PM Joey Tran > wrote: > >> Not sure what that suggests >> >> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user wrote: >> >>> Looks like this is the current behaviour. If you have `t = >>> beam.Filter(identity_filter)`, `t.label`

Re: Runner Bundling Strategies

2023-09-22 Thread Joey Tran
then it is > the runner's job to put as many calls to @ProcessElement as possible to > amortize. > > Kenn > > On Fri, Sep 22, 2023 at 9:39 AM Joey Tran > wrote: > >> Whoops, I typoed my last email. I meant to write "this isn't the >> greatest strategy for

Re: Runner Bundling Strategies

2023-09-22 Thread Joey Tran
g/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements On Thu, Sep 21, 2023 at 7:23 PM Joey Tran wrote: > Writing a runner and the first strategy for determining bundling size was > to just start with a bundle size of one and double it until

Runner Bundling Strategies

2023-09-21 Thread Joey Tran
Writing a runner and the first strategy for determining bundling size was to just start with a bundle size of one and double it until we reach a size that we expect to take some targets per-bundle runtime (e.g. maybe 10 minutes). I realize that this isn't the greatest strategy for high sized cost

Re: [Bug?] Combiner components don't inherit annotations of source CombineByKey

2023-09-15 Thread Joey Tran
Ended up just filing a PR [1] [1] https://github.com/apache/beam/pull/28489 On Fri, Sep 15, 2023 at 12:51 PM Joey Tran wrote: > While implementing a runner, we tried annotating a CombineByKey transform. > I noticed that the annotations for the CBK are then lost in the fusion > opt

[Bug?] Combiner components don't inherit annotations of source CombineByKey

2023-09-15 Thread Joey Tran
While implementing a runner, we tried annotating a CombineByKey transform. I noticed that the annotations for the CBK are then lost in the fusion optimization stage when the CBK is broken into components. Is this intentional? I can put up a PR if this seems worth fixing (it is for us at least),