New JIRA Component Request

2019-10-01 Thread Sam Rohde
Hi All, I am working improvements to the InteractiveRunner along side with +David Yan , +Ning Kang , and Alexey Strokach. I am requesting on behalf of this working group to add a new Jira component "runner-interactive" as the current list of components is insufficient. Regards, Sam

Re: [VOTE] Sign a pledge to discontinue support of Python 2 in 2020.

2019-10-01 Thread Sam Rohde
+1 On Tue, Oct 1, 2019 at 2:55 PM Ismaël Mejía wrote: > +1 > > On Tue, Oct 1, 2019, 10:40 PM Lukasz Cwik wrote: > >> +1 >> >> On Tue, Oct 1, 2019 at 10:39 AM Ning Kang wrote: >> >>> +1 >>> >>> On Tue, Oct 1, 2019 at 10:17 AM Pablo Estrada >>> wrote: >>> +1 I guess it was http:/

Re: New JIRA Component Request

2019-10-01 Thread Sam Rohde
; On Tue, Oct 1, 2019 at 3:16 PM Ning Kang wrote: > >> +1 >> FYI, I'm temporarily using examples-python component. >> >> On Tue, Oct 1, 2019 at 3:04 PM Sam Rohde wrote: >> >>> Hi All, >>> >>> I am working improvements to the In

A new contributor

2018-10-04 Thread Sam Rohde
Hi all! My name is Sam and I work for Google Cloud Dataflow. I'm going to be starting work on Apache Beam soon and I wish to be added as a contributor in the Beam issue tracker for JIRA as well as any other necessary permissions to start work. Thanks, Sam

Re: A new contributor

2018-10-04 Thread Sam Rohde
Thank you for creating Jira account. > > --Mikhail > > Have feedback <http://go/migryz-feedback>? > > > On Thu, Oct 4, 2018 at 12:06 PM Kenneth Knowles wrote: > >> I added you to the Contributor role, so you can be assigned JIRAs. >> (assuming your JIRA username i

Is portable.DirectGroupByKey supposed to be private?

2018-10-17 Thread Sam Rohde
Hi I'm working on deduplicating code from when the portable worker code was donated to the project. I found that the portable DirectGroupByKey is private and its inner classes. Is this by design? If so, why? If not, then I'm going to change it to be public. Regards, Sam

Re: Is portable.DirectGroupByKey supposed to be private?

2018-10-17 Thread Sam Rohde
Example: org.apache.beam.runners.direct.portable.DirectGroupByKey is not declared public (so it's private by default). Meaning that it ca't be used in the org.apache.beam.runners.direct package. On Wed, Oct 17, 2018 at 11:25 AM Sam Rohde wrote: > Hi I'm working on deduplicat

Add all tests to release validation

2019-01-07 Thread Sam Rohde
Hi All, There are a number of tests in our system that are either flaky or permanently red. I am suggesting to add, if not all, then most of the tests (style, unit, integration, etc) to the release validation step. In this way, we will add a regular cadence to ensuring greenness and no flaky tests

Re: Add all tests to release validation

2019-01-09 Thread Sam Rohde
odification to your proposal is that after manual >>>>> verification that it is safe to release I would move Fix Version to the >>>>> next release instead of closing, unless the issue really is fixed or >>>>> otherwise not reproducible. >>>>>

Beam JobService Problem

2019-01-14 Thread Sam Rohde
Hello all, While going through the codebase I noticed a problem with the Beam JobService. In particular, the API allows for the possibility of never seeing some messages or states with Get(State|Message)Stream. This is because the Get(State|Message)Stream calls need to have the job id which can o

Re: Add all tests to release validation

2019-01-15 Thread Sam Rohde
plumb through. >> >> --Mikhail >> >> Have feedback <http://go/migryz-feedback>? >> >> >> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner wrote: >> >>> +1, this sounds good to me. >>> >>> I believe the next step wou

Re: Beam JobService Problem

2019-01-15 Thread Sam Rohde
rate IDs, as the IDs are > already > >> scoped by preparation and invocation phase. > >> > >> - Would it be possible to just pass the preparation id as the > invocation id at > >> JobInvoker#invoke(..)? > >> > >> - Alternatively, we could

Magic number explanation in ParDoTest.java

2019-01-22 Thread Sam Rohde
Hi all, Does anyone have context why there is a magic number of "1774" milliseconds in the ParDoTest.java on line 2618? This is in the testEventTimeTimerAlignBounded method. File at master: https://github.com/apache/beam/blob/master/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParD

Re: Magic number explanation in ParDoTest.java

2019-01-22 Thread Sam Rohde
e commit comes from this PR: https://github.com/apache/beam/pull/2273 > > Kenn > > On Tue, Jan 22, 2019 at 10:21 AM Sam Rohde wrote: > >> Hi all, >> >> Does anyone have context why there is a magic number of "1774" >> milliseco

Re: Thoughts on a reference runner to invest in?

2019-02-11 Thread Sam Rohde
Thanks for starting this thread. If I had to guess, I would say there is more of a demand for Python as it's more widely used for data scientists/ analytics. Being pragmatic, the FnApiRunner already has more feature work than the Java so we should go with that. -Sam On Fri, Feb 8, 2019 at 10:07 A

Re: Add exception handling to MapElements

2019-02-11 Thread Sam Rohde
Interesting ideas! I think you're really honing in on what the Apache Beam API is missing: error handling for bad data and runtime errors. I like your method because it coalesces all the errors into a single collection to be looked at later. Also easy to add a PAssert on the errors collection. Loo

Re: [PROPOSAL] Prepare Beam 2.11.0 release

2019-02-11 Thread Sam Rohde
Thanks Ahmet! The 2.11.0 release will also be using the revised release process from PR-7529 that I authored. Let me know if you have any questions or if I can help in any way. I would love feedback on how to improve on the modifications I made and the rel

Re: Correct way to implement ProcessBundleProgressResponse in the Java SDK

2019-02-11 Thread Sam Rohde
Yeah, take a look at the ProcessRemoteBundleOperation.java class. This is the class that is in charge of handling

Re: Add exception handling to MapElements

2019-02-11 Thread Sam Rohde
ise spec into > the apply method"? I'm failing to parse what that means. > > On Mon, Feb 11, 2019 at 4:23 PM Sam Rohde wrote: > >> Interesting ideas! I think you're really honing in on what the Apache >> Beam API is missing: error handling for bad data and ru

Ordered PCollections eventually?

2021-05-10 Thread Sam Rohde
Hi All, I was wondering if there had been any plans for creating ordered PCollections in the Beam model? Or if there might be plans for them in the future? I know that Beam SQL and Beam DataFrames would directly benefit from an ordered PCollection. Regards, Sam

Re: Ordered PCollections eventually?

2021-05-10 Thread Sam Rohde
> > On Mon, May 10, 2021 at 2:56 PM Reuven Lax wrote: > >> There has been talk, but nothing concrete. >> >> On Mon, May 10, 2021 at 1:42 PM Sam Rohde wrote: >> >>> Hi All, >>> >>> I was wondering if there had been any plans for creating

Re: [design] A streaming Fn API runner for Python

2019-10-15 Thread Sam Rohde
Thanks for picking this up again Pablo, I wrote some small comments concerning the TestStream. On Tue, Oct 15, 2019 at 4:42 PM Robert Bradshaw wrote: > Very excited to see this! I've added some comments to the doc. > > On Tue, Oct 15, 2019 at 3:43 PM Pablo Estrada wrote: > >> I've just been inf

Multiple Outputs from Expand in Python

2019-10-24 Thread Sam Rohde
Hey All, I'm trying to implement an expand override with multiple output PCollections. The kicker is that I want to insert a new transform for each output PCollection. How can I do this? Regards, Sam

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Sam Rohde
. On Thu, Oct 24, 2019 at 5:20 PM Sam Rohde wrote: > Hey All, > > I'm trying to implement an expand override with multiple output > PCollections. The kicker is that I want to insert a new transform for each > output PCollection. How can I do this? > > Regards, > Sam >

Re: Multiple Outputs from Expand in Python

2019-10-28 Thread Sam Rohde
ngWithA = \ > >>> p | 'Words starting with A' >> beam.Create(['apple', 'ant', > 'arrow']) > >>> > >>> wordsStartingWithB = \ > >>> p | 'Words starting with B' >> beam.Create(['ba

Confusing multiple output semantics in Python

2019-11-07 Thread Sam Rohde
Hi All, In the Python SDK there are three ways of representing the output of a PTransform with multiple PCollections: - dictionary: PCollection tag --> PCollection - tuple: index --> PCollection - DoOutputsTuple: tag, index, or field name --> PCollection I find this inconsistent way of

Re: Confusing multiple output semantics in Python

2019-11-11 Thread Sam Rohde
> Internally, we have AppliedPTransform, where the output is always a > dictionary: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pipeline.py#L770 > And it seems to me that with key 'None', the output will be the main > output. > > Ning. > >

Re: Date/Time Ranges & Protobuf

2019-11-14 Thread Sam Rohde
My two cents are we just need a proto representation for timestamps and durations that includes units. The underlying library can then determine what to do with it. Then further, we can have a standard across Beam SDKs and Runners of how to interpret the proto. Using a raw int64 for timestamps and

Re: Date/Time Ranges & Protobuf

2019-11-18 Thread Sam Rohde
tamps together to > compute what the global watermark is for a PCollection. > > On Thu, Nov 14, 2019 at 3:15 PM Sam Rohde wrote: > >> My two cents are we just need a proto representation for timestamps and >> durations that includes units. The underlying library can then d

Re: Date/Time Ranges & Protobuf

2019-11-18 Thread Sam Rohde
s it makes sense to still use the format but work around > the limitation that is imposed. > > On Mon, Nov 18, 2019 at 11:25 AM Sam Rohde wrote: > >> Timestamp related question: I want to modify Python's utils/timestamp.py >> <https://github.com/apache/beam/blo

Re: Date/Time Ranges & Protobuf

2019-11-18 Thread Sam Rohde
Also made a JIRA: https://issues.apache.org/jira/projects/BEAM/issues/BEAM-8738 On Mon, Nov 18, 2019 at 2:32 PM Sam Rohde wrote: > Cool I wrote up https://github.com/apache/beam/pull/10146 > > On Mon, Nov 18, 2019 at 2:09 PM Luke Cwik wrote: > >> Sam, I think doing that ma

Re: [VOTE] Beam Mascot animal choice: vote for as many as you want

2019-11-20 Thread Sam Rohde
[ ] Beaver [ ] Hedgehog [ ] Lemur [ ] Owl [ ] Salmon [ ] Trout [x] Robot dinosaur [ ] Firefly [ ] Cuttlefish [ ] Dumbo Octopus [ ] Angler fish On Wed, Nov 20, 2019 at 9:22 AM Alex Amato wrote: > [ ] Beaver > [ ] Hedgehog > [ ] Lemur > [ ] Owl > [ ] Salmon > [ ] Trout > [X] Robot dinosaur > [ ] F

Python2.7 Beam End-of-Life Date

2020-02-04 Thread Sam Rohde
Hi All, Just curious when Beam will drop support for Python 2.7? Not being able to use all the nice features of 3.x and appeasing both 2.7 and 3.x linters is somewhat troublesome. Not to mention that all the nice work for the type hints will have to be redone in the for 3.x. It seems the faster we

Time precision in Python

2020-02-06 Thread Sam Rohde
Hi All, I saw that in the Python SDK we encode WindowedValues and Timestamps

Re: Time precision in Python

2020-02-06 Thread Sam Rohde
ntly, it won't result in out-of-order windows, but > it may result in timestamp truncation and (for sub-millisecond small > windows) even window identifiaction. > > On Thu, Feb 6, 2020 at 1:42 PM Sam Rohde wrote: > > > > Hi All, > > > > I saw that in the Python

[Interactive Runner] now available on master

2020-03-18 Thread Sam Rohde
Hi All! I am happy to announce that an improved Interactive Runner is now available on master. This Python runner allows for the interactive development of Beam pipelines in a notebook (and IPython) environment. The runner still has some bugs that need to be fixed as well as some refactoring,

[BEAM-9322] Python SDK discussion on correct output tag names

2020-03-24 Thread Sam Rohde
Hi All, *Problem* I would like to discuss BEAM-9322 and the correct way to set the output tags of a transform with nested PCollections, e.g. a dict of PCollections, a tuple of dicts of PCollections. Before the fixing of BEAM-1833

Unportable Dataflow Pipeline Questions

2020-03-31 Thread Sam Rohde
Hi All, I am currently investigating making the Python DataflowRunner to use a portable pipeline representation so that we can eventually get rid of the Pipeline(runner) weirdness. In that case, I have a lot questions about the Python DataflowRunner: *PValueCache* - Why does this exist? *Da

Re: [BEAM-9322] Python SDK discussion on correct output tag names

2020-03-31 Thread Sam Rohde
ay to go > here. I don't know if this would limit expressiveness. > > >> * Have a "best" effort naming system (note the example I give can have >> many of the "rules" re-ordered) e.g. if all the PCollection tags are unique >> then use only t

Re: Unportable Dataflow Pipeline Questions

2020-04-01 Thread Sam Rohde
oable. > > On Tue, Mar 31, 2020, 5:48 PM Robert Bradshaw wrote: > >> On Tue, Mar 31, 2020 at 12:06 PM Sam Rohde wrote: >> >>> Hi All, >>> >>> I am currently investigating making the Python DataflowRunner to use a >>> portable pipeline r

Re: [BEAM-9322] Python SDK discussion on correct output tag names

2020-04-01 Thread Sam Rohde
aw > wrote: > >> > >> On Tue, Mar 31, 2020 at 1:13 PM Sam Rohde wrote: > >> >>> > >> >>> * Don't allow arbitrary nestings returned during expansion, force > composite transforms to always provide an unambiguous name (either a tuple

Re: [BEAM-9322] Python SDK discussion on correct output tag names

2020-04-01 Thread Sam Rohde
esn't address the original issue.) > > On Wed, Apr 1, 2020 at 12:41 PM Sam Rohde wrote: > >> So then how does the proposal sound? >> >> Writing again here: >> PTransform.expand: (...) -> Union[PValue, NamedTuple[str, PCollection], >> Tuple[str, PCol

Re: [BEAM-9322] Python SDK discussion on correct output tag names

2020-04-01 Thread Sam Rohde
s in the PR too). > On Wed, Apr 1, 2020 at 2:57 PM Robert Bradshaw > wrote: > >> On Wed, Apr 1, 2020 at 1:48 PM Sam Rohde wrote: >> >>> To restate the original issue it is that the current method of setting >>> the output tags on PCollections from compos

More metadata in Coder Proto

2020-05-19 Thread Sam Rohde
Hi all, Should there be more metadata in the Coder Proto? For example, adding an "is_deterministic" boolean field. This will allow for a language-agnostic way for SDKs to infer properties about a coder received from the expansion service. My motivation for this is that I recently ran into a probl

Re: More metadata in Coder Proto

2020-05-19 Thread Sam Rohde
fore submission to the job server >> perform the expansion telling it all the limitations that the SDK has >> imposed on it. >> >> >> >> >> On Tue, May 19, 2020 at 3:45 PM Sam Rohde wrote: >> >>> Hi all, >>> >>> Should there be more

Re: More metadata in Coder Proto

2020-05-20 Thread Sam Rohde
what >>> your output shape must be during xlang pipeline expansion which is yet to >>> be defined to support such a case. Your suggested solution of adding >>> properties to coders is one possible solution but I think we have to take a >>> step back and c

Re: More metadata in Coder Proto

2020-05-20 Thread Sam Rohde
+Robert Bradshaw who is the reviewer on https://github.com/apache/beam/pull/11503. How does that sound to you? Skip the "is input deterministic" check for GBKs embedded in x-lang transforms? On Wed, May 20, 2020 at 10:56 AM Sam Rohde wrote: > Thanks for your comments, here'

Re: Proposal: ToStringFn

2020-10-28 Thread Sam Rohde
+Lukasz Cwik On Tue, Oct 27, 2020 at 12:04 PM Sam Rohde wrote: > Hi All, > > I'm working on a project in Dataflow that requires the runner to translate > an element to a human-readable form. To do this, I want to add a new > well-known transform that allows any runn

Re: Proposal: ToStringFn

2020-10-28 Thread Sam Rohde
done! On Wed, Oct 28, 2020 at 3:54 PM Tyson Hamilton wrote: > Can you open up comment access please? > > On Wed, Oct 28, 2020 at 3:40 PM Sam Rohde wrote: > >> +Lukasz Cwik >> >> On Tue, Oct 27, 2020 at 12:04 PM Sam Rohde wrote: >> >>> Hi All, >

Re: Unable to run python formater (Are the instructions out of date?)

2020-11-02 Thread Sam Rohde
I personally run `tox -e py37-lint` and `tox -e py3-yapf` from the root/sdks/python directory and that catches most stuff. If you are adding type annotations then also running `tox -e py37-mypy` is a good choice. Note that tox supports tab completion, so you can see all the different options by dou

Docker Development Environment

2020-11-25 Thread Sam Rohde
Hi All, I got tired of my local dev environment being ruined by updates so I made a container for Apache Beam development work. What this does is create a Docker container from the Ubuntu Groovy image and load it up with all the necessary libraries/utilities for Apache Beam development. Then I run

Re: Making preview (sample) time consistent on Direct runner

2021-01-05 Thread Sam Rohde
Hi Ismael, Those are good points. Do you know if the Interactive Runner has been tried in those instances? If so, what were the shortcomings? I can also see the use of sampling for a performance benchmarking reason. We have seen others send in known elements which are tracked throughout the pipel