elevant.
>- proposed approach and considered alternatives
>- any runner-specific considerations.
>
> Thanks,
> Valentyn
>
> On Fri, Mar 29, 2024 at 5:06 AM Joey Tran
> wrote:
>
>> I posted a PoC PR [1] for fixing deferred side inputs with combiners in
>> the
Yeah that was the tox command I was running
On Fri, Apr 5, 2024, 4:37 PM XQ Hu via dev wrote:
>
> https://cwiki.apache.org/confluence/display/BEAM/Python+Tips#PythonTips-LintandFormattingChecks
>
> This generally works well. Have you checked this?
>
> On Fri, Apr 5, 2024 at
I think I might be doing something silly with my environment.
I'm trying to lint using tox in a dev container, but running tox ends with
this error:
```
(env) jtran@[Beam Build Env.]:~/beam {flatmapdefault} ]
$ tox
File "/usr/lib/python3/dist-packages/tox/reporter.py", line 32, in
__init__
I posted a PoC PR [1] for fixing deferred side inputs with combiners in the
python SDK. Would someone be willing to take a look at it?
I have it working but could use some feedback on where to take it next. It
looks like bundle processor combiner operations don't currently support
side inputs [2]
gt; would remove that 'go get' line.
>
> There's a different issue at play here too since it was written for
> pre-module Go in mind. I'm unfamiliar with that script though.
>
> I'll take a proper look in a few hours.
>
> On Fri, Mar 22, 2024, 5:25 AM Joey Tran wrote:
>
>&
Hi,
I've been banging my head trying to get a dev environment working. I gave
up trying to get a local python environment working after I got some weird
clang errors and proto generation issues so I've been trying to just use
the docker container by running `bash start-build-env.sh` but I'm
;
> seems a bit error prone.
>
>
> On Thu, Mar 21, 2024 at 2:23 PM Joey Tran
> wrote:
>
>> Ah, I misunderstood your original suggestion then. That makes sense then.
>> I have already seen someone get a little confused about the names and
>> surprised that Flatten does
ion time in Python if you pass a single
> PCollection to Flatten. The scenario you describe concerns a one-element
> list.
>
> On Thu, Mar 21, 2024, 13:43 Joey Tran wrote:
>
>> I think it'd be quite surprising if beam.Flatten would become equivalent
>> to FlatMap if passed
t arg of lambda x: x is an interesting idea. The
>> only downside I see is a less clear error if one forgets to provide this
>> (now mandatory) parameter, but maybe that's low enough to be worth the
>> convenience?
>>
>> On Thu, Mar 21, 2024 at 12:02 PM Joey Tran
wrote:
> Hi, you can use beam.Flatten() instead.
>
> On Thu, Mar 21, 2024 at 10:55 AM Joey Tran
> wrote:
>
>> Hey all,
>>
>> Using an identity function for FlatMap comes up more often than using
>> FlatMap without an identity function. Would it make sense to use the
>> identity function as a default?
>>
>>
>>
>>
Hey all,
Using an identity function for FlatMap comes up more often than using
FlatMap without an identity function. Would it make sense to use the
identity function as a default?
yat...@akvelon.com> wrote:
> Hi Joey,
>
> Thanks for reaching out! I see that your changes haven't been deploed yet,
> so I've triggered the corresponding job and Playground will be updated soon.
>
> Thanks,
> Andrey
>
>
>
> *From: *Joey Tran
> *Reply to: *&q
""""""""""""""""""""""""""""""""""""""""""""""""""
Hey all,
I'm trying to get a beam python SDK dev environment going but I'm a bit
stuck. I'm just settings things up with a virtual env as specified in the
docs[1], but `pip install -e .[gcp,test]` ends with a clang error:
```
clang -Wsign-compare -Wunreachable-code -fno-common -dynamic
2.54.0.
>
> On Thu, Feb 8, 2024, 7:18 AM Joey Tran wrote:
>
>> Here's two:
>>
>> https://play.beam.apache.org/?path=SDK_PYTHON_MultipleOutputPardo=python
>> https://play.beam.apache.org/?path=SDK_PYTHON_WordCount=python
>>
>> Also, how often does playground get
sually we use beam.Map(print) to display some output values.
>
> On Wed, Feb 7, 2024 at 8:55 PM Joey Tran
> wrote:
>
>> Hey all,
>>
>> I've been really trying to use Playground for educating new Beam users
>> but it feels like there's something missing. A lot o
Hey all,
I've been really trying to use Playground for educating new Beam users but
it feels like there's something missing. A lot of examples (e.g. Multiple
ParDo Outputs) for at least the python API don't seem to do anything
observable. For example, the Multiple ParDo Outputs example writes to
groupbykey.
Thanks for the clarification!
On Fri, Jan 26, 2024 at 12:03 PM Robert Bradshaw
wrote:
> On Fri, Jan 26, 2024 at 8:43 AM Joey Tran
> wrote:
> >
> > Hmm, I think I might still be missing something. CombinePerKey is made
> up of "GBK() | CombineValues". P
the first
> worker will only emit [A, B] and the second [B, C] and only the B
> needs to be deduplicated post-shuffle.
>
> Wouldn't hurt to have a comment to that effect there.
>
> https://beam.apache.org/documentation/programming-guide/#combine
>
> On Fri, Jan 26, 2024 at 8:22 AM Joey Tra
Hey all,
I was poking around and looking at `Distinct` and was confused about why it
was implemented the way it was.
Reproduced here:
@ptransform_fn
@typehints.with_input_types(T)
@typehints.with_output_types(T)
def Distinct(pcoll): # pylint: disable=invalid-name
"""Produces a PCollection
Oh actually, overriding the fallback coder doesn't actually do anything
because the issue is not with the fallback coders in the registry but the
fastprimitivescoder's fallback coder
On Fri, Jan 5, 2024 at 12:42 PM Joey Tran wrote:
>
> I think my original message made it sound like
in
non-obvious downstream issues.*
On Fri, Jan 5, 2024 at 12:05 PM Robert Bradshaw via dev
wrote:
> On Fri, Jan 5, 2024 at 7:38 AM Joey Tran
> wrote:
>
>> I've been working with a few data types that are in practice
>> unpicklable and I've run into a couple issues stemmin
I've been working with a few data types that are in practice
unpicklable and I've run into a couple issues stemming from the `Any` type
hint, which when used, will result in the PickleCoder getting used even if
there's a coder in the coder registry that matches the data element.
This was pretty
art of the model so much as an implementation
> detail, but it likely does make sense to put somewhere common.
>
> On Wed, Dec 20, 2023 at 12:55 PM Joey Tran
> wrote:
>
>> Hey all,
>>
>> Is there a particular reason we hard code
>> "beam:runner:executable
Hey all,
Is there a particular reason we hard code "beam:runner:executable_stage:v1"
everywhere in the python SDK instead of putting it in common_urns?
ow them over time.
>
> On Fri, Dec 15, 2023 at 5:57 AM Joey Tran
> wrote:
>
>> Yeah I can confirm for the python runners (based on my reading of the
>> translations.py [1]) that only identical environments are merged together.
>>
>> The funny thing is that we _origi
ge hints as they like.
> Transform annotations might be an alternative, but how those are managed
> would be more SDK specific.
>
> On Fri, Dec 15, 2023, 5:21 AM Joey Tran wrote:
>
>> I figured out my issue. I thought side inputs were breaking up my
>> pipeline but after exp
oFns, those
> are usually relegated to the root of a fused stage, and avoids fusions with
> each other. That can also cause additional stages.
>
> If Beam adopted a rigorous notion of Key Preserving for transforms,
> multiple stateful transforms could be fused in the same s
ons), but they principally boil down to
> shapes that look like this.
>
> Though this does not introduce a global barrier in streaming, there is
> still the analogous per window/watermark barrier that prevents fusion for
> the same reasons.
>
>
>
>
> On Thu, Dec 14, 20
Hey all,
We have a pretty big pipeline and while I was inspecting the stages, I
noticed there is less fusion than I expected. I suspect it has to do with
the heavy use of side inputs in our workflow. In the python sdk, I see that
side inputs are considered when determining whether two stages are
d try to make it
> crash and maybe find a stacktrace? Setting logging could like like so:
> https://github.com/apache/beam/blob/729c4de416b8252ec99f0a1253ac7af3023733df/sdks/python/apache_beam/examples/wordcount.py#L110
>
> On Wed, Nov 15, 2023 at 12:06 PM Joey Tran
> wrote:
>
>> +1 to at least setting the log level to higher than info. Some runner
>> logging (e.g. job started/done) may be useful.
>>
>> On Tue, Nov 14, 2023 at 9:37 AM Joey Tran
>> wrote:
>> >
>> > Hi all,
>> >
>> > I just had a worksho
Hi all,
I just had a workshop to demo beam for people at my company and there was a
bit of confusion about whether the beam python playground examples were
even working and it turned out they just got confused by all the runner
logging that is output.
Is this worth keeping? It seems like it'd be
PR for top: https://github.com/apache/beam/pull/29106
On Mon, Oct 23, 2023 at 10:11 AM XQ Hu via dev wrote:
> +1 on this idea. Thanks!
>
> On Thu, Oct 19, 2023 at 3:40 PM Joey Tran
> wrote:
>
>> Yeah, I already implemented these partitioners for my use case (I just
>
che/beam/blob/e7a6405800a83dd16437b8b1b372e020e010a042/sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java#L630
On Fri, Oct 13, 2023 at 1:32 PM Joey Tran wrote:
>
>
> On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw
> wrote:
>
>> On Fri, Oct 13, 2023 at 10:0
/github.com/apache/beam/blob/68e9c997a9085b0cb045238ae406d534011e7c21/sdks/python/apache_beam/transforms/combiners.py#L191
>
> On Thu, Oct 19, 2023 at 3:21 PM Joey Tran
> wrote:
>
>> Yes, both need to be small enough to fit into state.
>>
>> Yeah a percentage sam
> Danny
>
> On Thu, Oct 19, 2023 at 10:06 AM Joey Tran
> wrote:
>
>> Hey all,
>>
>> While writing a few pipelines, I was surprised by how few partitioners
>> there were in the python SDK. I wrote a couple that are pretty generic and
>> possibl
Hey all,
While writing a few pipelines, I was surprised by how few partitioners
there were in the python SDK. I wrote a couple that are pretty generic and
possibly generally useful. Just wanted to do a quick poll to see if they
seem useful enough to be in the sdk's library of transforms. If so, I
On Fri, Oct 13, 2023 at 1:18 PM Robert Bradshaw wrote:
> On Fri, Oct 13, 2023 at 10:08 AM Joey Tran
> wrote:
>
Are there places on the SDK side that expect unique labels? Or in
>> non-updateable runners?
>>
>
> That's a good question. The label eventually ends up
ttribute old B_2's state
> to the new B_2 (and also possibly mis-direct any inflight messages). At
> least with the old, intersecting names we can detect this problem
> rather than silently give corrupt data.
>
>
> On Fri, Oct 13, 2023 at 7:15 AM Joey Tran
> wrote:
>
>>
ior similar to java, I'm happy
>> to put up a PR
>>
>> On Thu, Oct 5, 2023, 12:49 PM Joey Tran
>> wrote:
>>
>>> Is it really toggleable in Java? I imagine that if it's a toggle it'd be
>>> a very sticky toggle since it'd be easy for PTransforms to
Bump on this. Sorry to pester - I'm trying to get a few teams to adopt
Apache Beam at my company and I'm trying to foresee parts of the API they
might find inconvenient.
If there's a conclusion to make the behavior similar to java, I'm happy to
put up a PR
On Thu, Oct 5, 2023, 12:49 PM Joey Tran
ble
> with an option now. We should probably add the option to toggle Python too.
> (Unclear what the default should be, but this probably ties into
> re-thinking how pipeline update should work.)
>
> On Thu, Oct 5, 2023 at 4:58 AM Joey Tran
> wrote:
>
>> Makes sense th
alified transform name, so the
> naming only has to be distinct within a composite transform (or at the top
> level--the pipeline itself is isomorphic to a single composite transform).
>
> On Wed, Oct 4, 2023 at 3:43 AM Joey Tran
> wrote:
>
>> Cross posting this thread to dev@
On Tue, Oct 3, 2023 at 9:15 PM Joey Tran
> wrote:
>
>> Not sure what that suggests
>>
>> On Tue, Oct 3, 2023, 6:24 PM XQ Hu via user wrote:
>>
>>> Looks like this is the current behaviour. If you have `t =
>>> beam.Filter(identity_filter)`, `t.label`
then it is
> the runner's job to put as many calls to @ProcessElement as possible to
> amortize.
>
> Kenn
>
> On Fri, Sep 22, 2023 at 9:39 AM Joey Tran
> wrote:
>
>> Whoops, I typoed my last email. I meant to write "this isn't the
>> greatest strategy for
g/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.BatchElements
On Thu, Sep 21, 2023 at 7:23 PM Joey Tran wrote:
> Writing a runner and the first strategy for determining bundling size was
> to just start with a bundle size of one and double it until
Writing a runner and the first strategy for determining bundling size was
to just start with a bundle size of one and double it until we reach a size
that we expect to take some targets per-bundle runtime (e.g. maybe 10
minutes). I realize that this isn't the greatest strategy for high sized
cost
Ended up just filing a PR [1]
[1] https://github.com/apache/beam/pull/28489
On Fri, Sep 15, 2023 at 12:51 PM Joey Tran
wrote:
> While implementing a runner, we tried annotating a CombineByKey transform.
> I noticed that the annotations for the CBK are then lost in the fusion
> opt
While implementing a runner, we tried annotating a CombineByKey transform.
I noticed that the annotations for the CBK are then lost in the fusion
optimization stage when the CBK is broken into components. Is this
intentional?
I can put up a PR if this seems worth fixing (it is for us at least),
50 matches
Mail list logo