Re: Apache Pulsar connector for Beam

2019-10-25 Thread Taher Koitawala
I would be interested in contributing to the Pulsar Beam connector. That's one of the reasons i started the email thread. Regards, Taher Koitawala On Sat, Oct 26, 2019, 9:41 AM Sijie Guo wrote: > This is Sijie Guo from StreamNative and Pulsar PMC. > > Maximilian - thank you for adding us in th

Re: Apache Pulsar connector for Beam

2019-10-25 Thread Sijie Guo
This is Sijie Guo from StreamNative and Pulsar PMC. Maximilian - thank you for adding us in the email thread! We do have one roadmap item for adding a Beam connector for Pulsar. It was planned for this quarter, but we haven’t started the implementation yet. If the Beam community is interested in

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Valentyn Tymofieiev
Thanks, Brian. +Udi Meiri As next step, it would be good to know whether slowdown is caused by tests in this PR, or its effect on other tests, and to confirm that only Python 2 codepaths were affected. On Fri, Oct 25, 2019 at 6:35 PM Brian Hulette wrote: > I did a bisect based on the runtime of

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Brian Hulette
I did a bisect based on the runtime of `./gradlew :sdks:python:test-suites:tox:py2:testPy2Gcp` around the commits between 9/1 and 9/15 to see if I could find the source of the spike that happened around 9/6. It looks like it was due to PR#9283 [1]. I thought maybe this search would reveal some mis-

Re: [EXT] Re: Interactive Beam Example Failing [BEAM-8451]

2019-10-25 Thread Chuck Yang
IMO returning the input PCollection in a PTransform should be a valid albeit trivial use case. I have put a suggested fix for supporting these kinds of transforms in the interactive runner here: https://github.com/apache/beam/pull/9865 . I'm also new to beam dev so if there's something I'm missing

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Pablo Estrada
I think it makes sense to remove some of the extra FnApiRunner configurations. Perhaps some of the multiworkers and some of the grpc versions? Best -P. On Fri, Oct 25, 2019 at 12:27 PM Robert Bradshaw wrote: > It looks like fn_api_runner_test.py is quite expensive, taking 10-15+ > minutes on eac

Re: JIRA priorities explaination

2019-10-25 Thread Pablo Estrada
That SGTM On Fri, Oct 25, 2019 at 4:18 PM Robert Bradshaw wrote: > +1 to both. > > On Fri, Oct 25, 2019 at 3:58 PM Valentyn Tymofieiev > wrote: > > > > On Fri, Oct 25, 2019 at 3:39 PM Kenneth Knowles wrote: > >> > >> Suppose, hypothetically, we say that if Fix Version is set, then > P0/Blocker

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
+1 to both. On Fri, Oct 25, 2019 at 3:58 PM Valentyn Tymofieiev wrote: > > On Fri, Oct 25, 2019 at 3:39 PM Kenneth Knowles wrote: >> >> Suppose, hypothetically, we say that if Fix Version is set, then P0/Blocker >> and P1/Critical block release and lower priorities get bumped. > > > +1 to Kenn'

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Robert Bradshaw
You can literally return a Python tuple of outputs from a composite transform as well. (Dicts with PCollections as values are also supported, if you want things to be named rather than referenced by index.) On Fri, Oct 25, 2019 at 4:06 PM Ahmet Altay wrote: > > Is DoOutputsTuple what you are look

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Ahmet Altay
Is DoOutputsTuple what you are looking for? [1] You can look at this expand function using it [2]. [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/pvalue.py#L204 [2] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L1283 On Fri, Oct 25,

Re: JIRA priorities explaination

2019-10-25 Thread Valentyn Tymofieiev
On Fri, Oct 25, 2019 at 3:39 PM Kenneth Knowles wrote: > Suppose, hypothetically, we say that if Fix Version is set, then > P0/Blocker and P1/Critical block release and lower priorities get bumped. > +1 to Kenn's suggestion. In addition, we can discourage setting Fix version for non-critical is

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Luke Cwik
My example is about multiple inputs and not multiple outputs from further investigation it seems as I don't know. Looking at the documentation online[1] doesn't seem to specify how to do this either for composite transforms. All the examples are of the single output variety as well[2]. 1: https:/

Re: JIRA priorities explaination

2019-10-25 Thread Kenneth Knowles
Suppose, hypothetically, we say that if Fix Version is set, then P0/Blocker and P1/Critical block release and lower priorities get bumped. Most likely the release manager still pings and asks about all those before bumping. Which means that in effect they were part of the burn down and do block th

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Luke Cwik
Approach 3 is about caching the bundle descriptor forever but tearing down a "live" instance of the DoFns at some SDK chosen arbitrary point in time. This way if a future ProcessBundleRequest comes in, a new "live" instance can be constructed. Approach 2 is still needed so that when the workers are

Re: JIRA priorities explaination

2019-10-25 Thread Kenneth Knowles
My takeaway from this thread is that priorities should have a shared community intuition and/or policy around how they are treated, which could eventually be formalized into SLOs. At a practical level, I do think that build breaks are higher priority than release blockers. If you are on this threa

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Robert Burke
Approach 2 isn't incompatible with approach 3. 3 simple sets down convention/configuration for the conditions when the SDK will do this after process bundle has completed. On Fri, Oct 25, 2019, 12:34 PM Robert Bradshaw wrote: > I think we'll still need approach (2) for when the pipeline finish

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
I'm fine with that, but in that case we should have a priority for release blockers, below which bugs get automatically bumped to the next release (and which becomes the burndown list). On Fri, Oct 25, 2019 at 1:58 PM Kenneth Knowles wrote: > > My takeaway from this thread is that priorities shou

Re: contributor permission for Beam Jira tickets

2019-10-25 Thread Kenneth Knowles
Assuming your Jira id is pbhosale87, I have added you to Contributors role, so you can be assigned a Jira. Kenn On Fri, Oct 25, 2019 at 1:22 PM Pradeep Bhosale < bhosale.pradeep1...@gmail.com> wrote: > Hi, > > This is Pradeep from Conde Nast. > I'm currently working on Apache beam and would li

contributor permission for Beam Jira tickets

2019-10-25 Thread Pradeep Bhosale
Hi, This is Pradeep from Conde Nast. I'm currently working on Apache beam and would like to suggest/fix few issues for AWS IOs. Can someone add me as a contributor for Beam's Jira issue tracker? I would like to create/assign tickets for my work. Thanks, Pradeep

Re: DynamoDBIO related issue

2019-10-25 Thread Pradeep Bhosale
On 2019/10/25 16:04:53, Luke Cwik wrote: > If you create a JIRA account and share your user id with us, we will grant > you contributor access which will allow you to create a JIRA issue. > > Please take a look at the our contribution guide, it mentions how to > connect with the Beam communit

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Robert Bradshaw
I think we'll still need approach (2) for when the pipeline finishes and a runner is tearing down workers. On Fri, Oct 25, 2019 at 10:36 AM Maximilian Michels wrote: > > Hi Jincheng, > > Thanks for bringing this up and capturing the ideas in the doc. > > Intuitively, I would have also considered

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Robert Bradshaw
It looks like fn_api_runner_test.py is quite expensive, taking 10-15+ minutes on each version of Python. This test consists of a base class that is basically a validates runner suite, and is then run in several configurations, many more of which (including some expensive ones) have been added latel

Re: Interactive Beam Example Failing [BEAM-8451]

2019-10-25 Thread David Yan
+Sam Rohde is working on streaming support for Interactive Beam. The high level idea is to capture a bounded segment of the unbounded data source for replayablity and determinism, and to use TestStream, which has the ability to control the clock of the pipeline, to replay the data, so streaming s

Re: JIRA priorities explaination

2019-10-25 Thread Robert Bradshaw
We cut a release every 6 weeks, according to schedule, making it easy to plan for, and the release manager typically sends out a warning email to remind everyone. I don't think it makes sense to do that for every ticket. Blockers should be reserved for things we really shouldn't release without. O

Re: Intermittent No FileSystem found exception

2019-10-25 Thread Koprivica,Preston Blake
I misspoke on the temporary workaround. Should use #withIgnoreWindowing() option on FileIO. On 10/25/19, 11:33 AM, "Maximilian Michels" wrote: Hi Maulik, Thanks for reporting. As Preston already pointed out, this is fixed in the upcoming 2.17.0 release. Thanks, Max

Re: JIRA priorities explaination

2019-10-25 Thread Pablo Estrada
I mentioned on the PR that I had been using the 'blocker' priority along with the 'fix version' field to mark issues that I want to get in the release. Of course, this little practice of mine only matters much around release branch cutting time - and has been useful for me to track which things I w

Re: [DISCUSS] New Beam pipeline diagrams

2019-10-25 Thread Kenneth Knowles
These are really clean and clear. Nice! On Thu, Oct 24, 2019 at 9:05 AM Cyrus Maden wrote: > Hi all, > > Thank you to everyone who reached out to me and/or commented on my > proposal for new Beam pipeline diagrams[1]. I've compiled all of your > suggestions and updated the design accordingly. *H

Re: Python Precommit duration pushing 2 hours

2019-10-25 Thread Valentyn Tymofieiev
I took another look at this and precommit ITs are already running in parallel, albeit in the same suite. However it appears Python precommits became slower, especially Python 2 precommits [35 min per suite x 3 suites], see [1]. Not sure yet what caused the increase, but precommits used to be faster

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Maximilian Michels
Hi Jincheng, Thanks for bringing this up and capturing the ideas in the doc. Intuitively, I would have also considered adding a new Proto message for the teardown, but I think the idea to trigger this logic when the SDK Harness evicts process bundle descriptors is more elegant. Thanks, Max

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Luke Cwik
I believe PCollectionTuple should be unnecessary since Python has first class support for tuples as shown in the example below[1]. Can we use tuples to solve your issue? wordsStartingWithA = \ p | 'Words starting with A' >> beam.Create(['apple', 'ant', 'arrow']) wordsStartingWithB = \ p |

Re: Multiple Outputs from Expand in Python

2019-10-25 Thread Sam Rohde
Talked to Daniel offline and it looks like the Python SDK is missing PCollection Tuples like the one Java has: https://github.com/rohdesamuel/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollectionTuple.java . I'll go ahead and implement that for the Python SDK. On Th

Re: Apache Pulsar connector for Beam

2019-10-25 Thread Maximilian Michels
It would be great to have a Pulsar connector. We might want to ask the folks from StreamNative (in CC). Any plans? :) Cheers, Max On 24.10.19 18:31, Pablo Estrada wrote: There's a JIRA issue to track this: https://issues.apache.org/jira/browse/BEAM-8218 Alex was kind enough to file it. +Alex

Re: Intermittent No FileSystem found exception

2019-10-25 Thread Maximilian Michels
Hi Maulik, Thanks for reporting. As Preston already pointed out, this is fixed in the upcoming 2.17.0 release. Thanks, Max On 24.10.19 15:20, Koprivica,Preston Blake wrote: Hi Maulik, I believe you may be witnessing this issue: https://issues.apache.org/jira/browse/BEAM-8303.  We ran into

Re: DynamoDBIO related issue

2019-10-25 Thread Luke Cwik
If you create a JIRA account and share your user id with us, we will grant you contributor access which will allow you to create a JIRA issue. Please take a look at the our contribution guide, it mentions how to connect with the Beam community including creating a JIRA account[1]. 1: https://beam

Re: [DISCUSS] How to stopp SdkWorker in SdkHarness

2019-10-25 Thread Luke Cwik
I like approach 3 since it doesn't add additional complexity to the API and individual SDKs can choose to implement any clean-up strategy they want or none at all which is the simplest. On Thu, Oct 24, 2019 at 8:46 PM jincheng sun wrote: > Hi, > > Thanks for your comments in doc, I have add Appr

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Ryan Skraba
One more thing to try -- depending on your pipeline, you can disable the "auto-reshuffle" of JdbcIO.Read by setting withOutputParallelization(false) This is particularly useful if (1) you do aggressive and cheap filtering immediately after the read or (2) you do your own repartitioning action like

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Eugene Kirpichov
Yeah - in this case your primary option is to use JdbcIO.readAll() and shard your query, as suggested above. Alternative hypothesis: is the result set of your query actually big enough that it *shouldn't* fit in memory? Or could it be a matter of inefficient storage of its elements? Could you brie

Re: JdbcIO read needs to fit in memory

2019-10-25 Thread Jozef Vilcek
I agree I might be too quick to call DoFn output need to fit in memory. Actually I am not sure what Beam model say on this matter and what output managers of particular runners do about it. But SparkRunner definitely has an issue here. I did try set small `fetchSize` for JdbcIO as well as change `

Re: Jenkin jobs taking too long

2019-10-25 Thread Kyle Weaver
Thanks Alexey. Forgot what magic word. 🙂 On Fri, Oct 25, 2019 at 11:48 AM Alexey Romanenko wrote: > > On 25 Oct 2019, at 09:18, Kyle Weaver wrote: > Does anyone know if there's a PR command for "rerun all”? > > > Do you mean “Retest this please” ? > > > > On Fri, Oct 25, 2019 at 9:09 AM Rehman

Re: Jenkin jobs taking too long

2019-10-25 Thread Alexey Romanenko
> On 25 Oct 2019, at 09:18, Kyle Weaver wrote: > Does anyone know if there's a PR command for "rerun all”? Do you mean “Retest this please” ? > > On Fri, Oct 25, 2019 at 9:09 AM Rehman Murad Ali > mailto:rehman.murad...@venturedive.com>> > wrote: > Hello, > > It's been more than 17 hours,

Re: Jenkin jobs taking too long

2019-10-25 Thread Rehman Murad Ali
Thank you Kyle for letting me know work around. *Thanks & Regards* *Rehman Murad Ali* Software Engineer Mobile: +92 3452076766 Skype: rehman,muradali On Fri, Oct 25, 2019 at 12:26 PM Kyle Weaver wrote: > Looks like Jenkins might have got stuck for a while (this h

Re: Jenkin jobs taking too long

2019-10-25 Thread Kyle Weaver
Looks like Jenkins might have got stuck for a while (this happened on my PR as well), but it seems to be working again now. You can manually re-run individual jobs with a comment on the PR containing one of the commands listed (such as Run Python PreCommit). Does anyone know if there's a PR comman

Jenkin jobs taking too long

2019-10-25 Thread Rehman Murad Ali
Hello, It's been more than 17 hours, Jenkin jobs are still running. Here is the PR https://github.com/apache/beam/pull/9677. Any help would be appreciated. Rehman