Re: Jacek's new Apache Beam Internals Project

2020-05-03 Thread Holden Karau
That's my understanding. I believe he's starting to explore Beam & Kafka in
the same way he did with Spark which is exciting to me.

On Thu, Apr 30, 2020 at 10:17 AM Reuven Lax  wrote:

> I took a look at the book - there's not really much there. Maybe he's
> planning on adding more over time?
>
> On Tue, Apr 28, 2020 at 1:48 PM Ismaël Mejía  wrote:
>
>> The tweet URL for ref in case someone wants to like/RT
>>
>> https://twitter.com/jaceklaskowski/status/1255046717277376512?s=19
>>
>> On Tue, Apr 28, 2020, 8:04 PM Holden Karau  wrote:
>>
>>> Hi Folks,
>>>
>>> I just saw Jacek's tweet about his new Beam Internals project (he's done
>>> a great job on his Spark Internals documentation and blog posts) and I
>>> figured I'd share the link
>>> https://leanpub.com/the-internals-of-apache-beam in case folks are
>>> interested :)
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Jacek's new Apache Beam Internals Project

2020-04-28 Thread Holden Karau
Hi Folks,

I just saw Jacek's tweet about his new Beam Internals project (he's done a
great job on his Spark Internals documentation and blog posts) and I
figured I'd share the link https://leanpub.com/the-internals-of-apache-beam in
case folks are interested :)

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
On Fri, Apr 17, 2020 at 3:52 PM Robert Bradshaw  wrote:

> On Fri, Apr 17, 2020 at 2:56 PM Holden Karau  wrote:
>
>>
>> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw 
>> wrote:
>>
>>> Hi Holden!
>>>
>>> I agree with Kyle that it makes sense to have some caveat about Flink
>>> and Spark, though at this point they're not /that/ new (at least not
>>> Flink).
>>>
>> True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
>> support isn't yet mature enough (although there is interest in integrating
>> it in Kubeflow I believe, it hasn't happened yet).
>>
>
> I might just say "not as mature." Most of the work being done now is
> fit-n-finish. There's also some extra flags that need to be passed to work
> around bugs in Flink itself encountered when running TFX jobs.
>
Does this currently work at scale? The last time I tried to use TFX on Beam
on Flink it had difficulty at data above ~10mb.

> (There's the separate question of using Kubernetes to deploy/manage the
> Flink cluster itself, but the mode where Flink workers invoke docker to
> start up the Python binaries is pretty stable at this point.)
>
So we would say maybe the OSS path would be to run TFX on Beam on Flink on
YARN (like EMR)?

>
>
>> I am curious what extra support Kubeflow is "missing" (or, conversely,
>>> what extra support it has for Dataflow that goes beyond just specifying a
>>> different runner) to the point that these runners are declared
>>> "unsupported." Or it it literally a matter of not providing user support?
>>>
>> So the Kubeflow TFX components (in
>> https://github.com/kubeflow/pipelines/tree/master/components) are
>> limited to local mode.
>>
>
> So in that sense it's not less supported than Dataflow?
>
>From the component side it’s the same. But if someone wanted do it “by
hand” Dataflow offers better support.

>
>
>>
>>> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver 
>>> wrote:
>>>
>>>> Hi Holden,
>>>>
>>>> The note on Flink & Spark support sounds reasonable to me. I am
>>>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>>>> agree that we don't want to over-promise.
>>>>
>>>> I'm not so sure about the status of Dataflow here, perhaps someone else
>>>> can comment on that.
>>>>
>>>> Looking forward to the book :)
>>>>
>>>> Kyle
>>>>
>>>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
>>>> wrote:
>>>>
>>>>> Hi Apache Beam Developers,
>>>>>
>>>>> I'm working on a book about Kubeflow, which naturally has a section on
>>>>> TFX. I want to set users expectations correctly so I wanted to know what
>>>>> y'all thought of this NOTE we were thinking of including in the early
>>>>> release:
>>>>>
>>>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>>>> Beam's Python support. You can scale your job by using the non-portable
>>>>> dataflow component, but this requires changing your pipeline code and 
>>>>> isn't
>>>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>>>> for Apache Flink & Spark improves support may be added for scaling the TFX
>>>>> components in a portable manner.
>>>>>
>>>>> Does this sound reasonable to folks? I don't want to over-promise but
>>>>> I also don't want to scare people away given all of the progress that is
>>>>> being made in supporting the open-source runners with language 
>>>>> portability.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Holden :)
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw  wrote:

> Hi Holden!
>
> I agree with Kyle that it makes sense to have some caveat about Flink and
> Spark, though at this point they're not /that/ new (at least not Flink).
>
True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
support isn't yet mature enough (although there is interest in integrating
it in Kubeflow I believe, it hasn't happened yet).

>
> I am curious what extra support Kubeflow is "missing" (or, conversely,
> what extra support it has for Dataflow that goes beyond just specifying a
> different runner) to the point that these runners are declared
> "unsupported." Or it it literally a matter of not providing user support?
>
So the Kubeflow TFX components (in
https://github.com/kubeflow/pipelines/tree/master/components) are limited
to local mode.

>
> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:
>
>> Hi Holden,
>>
>> The note on Flink & Spark support sounds reasonable to me. I am
>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>> agree that we don't want to over-promise.
>>
>> I'm not so sure about the status of Dataflow here, perhaps someone else
>> can comment on that.
>>
>> Looking forward to the book :)
>>
>> Kyle
>>
>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
>> wrote:
>>
>>> Hi Apache Beam Developers,
>>>
>>> I'm working on a book about Kubeflow, which naturally has a section on
>>> TFX. I want to set users expectations correctly so I wanted to know what
>>> y'all thought of this NOTE we were thinking of including in the early
>>> release:
>>>
>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>> Beam's Python support. You can scale your job by using the non-portable
>>> dataflow component, but this requires changing your pipeline code and isn't
>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>> for Apache Flink & Spark improves support may be added for scaling the TFX
>>> components in a portable manner.
>>>
>>> Does this sound reasonable to folks? I don't want to over-promise but I
>>> also don't want to scare people away given all of the progress that is
>>> being made in supporting the open-source runners with language portability.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
On Fri, Apr 17, 2020 at 2:32 PM Ahmet Altay  wrote:

> Hi Holden, nice to hear from you. Thanks a lot for this email. Adding some
> TFX folks as well. +Robert Crowe  +Irene
> Giannoumis  +Zhitao Li  +Anusha
> Ramesh 
>
> Would it be possible for TFX folks to review the TFX section of your book?
>
Sure. Currently we only cover TFT and TFDV and I can share the draft of
that chapter with TFX folks but we might cover more later.

>
> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:
>
>> Hi Holden,
>>
>> The note on Flink & Spark support sounds reasonable to me. I am
>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>> agree that we don't want to over-promise.
>>
>> I'm not so sure about the status of Dataflow here, perhaps someone else
>> can comment on that.
>>
>
> I believe TFX/KFP works on Dataflow with the same pipeline. (They have an
> example on this
> https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/template.ipynb
>  -
> step 8)
>
>
So that is only the TFX pipeline, if you want to use Kubeflow pipelines
with the TFX components that’s not supported.

>
>> Looking forward to the book :)
>>
>> Kyle
>>
>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
>> wrote:
>>
>>> Hi Apache Beam Developers,
>>>
>>> I'm working on a book about Kubeflow, which naturally has a section on
>>> TFX. I want to set users expectations correctly so I wanted to know what
>>> y'all thought of this NOTE we were thinking of including in the early
>>> release:
>>>
>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>> Beam's Python support. You can scale your job by using the non-portable
>>> dataflow component, but this requires changing your pipeline code and isn't
>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>> for Apache Flink & Spark improves support may be added for scaling the TFX
>>> components in a portable manner.
>>>
>>> Does this sound reasonable to folks? I don't want to over-promise but I
>>> also don't want to scare people away given all of the progress that is
>>> being made in supporting the open-source runners with language portability.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
Hi Apache Beam Developers,

I'm working on a book about Kubeflow, which naturally has a section on TFX.
I want to set users expectations correctly so I wanted to know what y'all
thought of this NOTE we were thinking of including in the early release:

Apache Beam’s Python support outside of Google cloud's Dataflow is
relatively new. TFX is a Python tool, so scaling it depends on Apache
Beam's Python support. You can scale your job by using the non-portable
dataflow component, but this requires changing your pipeline code and isn't
supported by Kubeflow's current TFX components. As Apache Beam's support
for Apache Flink & Spark improves support may be added for scaling the TFX
components in a portable manner.

Does this sound reasonable to folks? I don't want to over-promise but I
also don't want to scare people away given all of the progress that is
being made in supporting the open-source runners with language portability.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: [ANNOUNCE] New committer: Robert Burke

2019-07-16 Thread Holden Karau
Congratulations! :)

On Tue, Jul 16, 2019 at 10:50 AM Mikhail Gryzykhin 
wrote:

> Congratulations!
>
> On Tue, Jul 16, 2019 at 10:36 AM Ankur Goenka  wrote:
>
>> Congratulations Robert!
>>
>> Go GO!
>>
>> On Tue, Jul 16, 2019 at 10:34 AM Rui Wang  wrote:
>>
>>> Congrats!
>>>
>>>
>>> -Rui
>>>
>>> On Tue, Jul 16, 2019 at 10:32 AM Udi Meiri  wrote:
>>>
 Congrats Robert B.!

 On Tue, Jul 16, 2019 at 10:23 AM Ahmet Altay  wrote:

> Hi,
>
> Please join me and the rest of the Beam PMC in welcoming a new
> committer: Robert Burke.
>
> Robert has been contributing to Beam and actively involved in the
> community for over a year. He has been actively working on Go SDK, helping
> users, and making it easier for others to contribute [1].
>
> In consideration of Robert's contributions, the Beam PMC trusts him
> with the responsibilities of a Beam committer [2].
>
> Thank you, Robert, for your contributions and looking forward to many
> more!
>
> Ahmet, on behalf of the Apache Beam PMC
>
> [1]
> https://lists.apache.org/thread.html/8f729da2d3009059d7a8b2d8624446be161700dcfa953939dd3530c6@%3Cdev.beam.apache.org%3E
> [2] https://beam.apache.org/contribute/become-a-committer
> /#an-apache-beam-committer
>


-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: New Edit button on beam.apache.org pages

2018-10-29 Thread Holden Karau
So awesome :)

On Mon, Oct 29, 2018, 7:50 AM Etienne Chauchot  Cool !
> Thanks
> Etienne
> Le mercredi 24 octobre 2018 à 14:24 -0700, Alan Myrvold a écrit :
>
> To make small documentation changes easier, there is now an Edit button at
> the top right of the pages on https://beam.apache.org. This button opens
> the source .md file on the master branch of the beam repository in the
> github web editor. After making changes you can create a pull request to
> ask to have it merged.
>
> Thanks to Scott for the suggestion to add this in [BEAM-4431]
> 
>
> Let me know if you run into any issues.
>
> Alan
>
>
>


Re: Live coding & reviewing adventures

2018-08-02 Thread Holden Karau
Ok Gris has an even more delayed laptop so I'm going to push it out a week
and hope it shows up for then. Sorry about that one and thanks for everyone
who tuned in for the Go SDK one :)

On Mon, Jul 30, 2018 at 1:54 PM, Holden Karau  wrote:

> So small schedule changes.
> I’ll be doing some poking at the Go SDK at 2pm today -
> https://www.youtube.com/watch?v=9UAu1DOZJhM and the one with Gris setting
> up Beam on a new machine will be moved to Friday because her laptop got
> delayed - https://www.youtube.com/watch?v=x8Wg7qCDA5k
>
> On Tue, Jul 24, 2018 at 8:41 PM Holden Karau  wrote:
>
>> I'll be doing this again this week & next looking at a few different
>> topics.
>>
>> Tomorrow (July 25th @ 10am pacific) Gris & I will be updating the PR from
>> my last live stream (adding Python dependency handling) -
>> https://www.twitch.tv/events/P92irbgYR9Sx6nMQ-lGY3g /
>> https://www.youtube.com/watch?v=4xDsY5QL2zM
>>
>> In the afternoon @ 3 pm pacific I'll be looking at the dev tools we've
>> had some discussions around with respect to reviews -
>> https://www.twitch.tv/events/vNzcZ7DdSuGFNYURW_9WEQ /
>> https://www.youtube.com/watch?v=6cTmC_fP9B0
>>
>> Next week on Thursday August 1st @ 2pm pacific Gris & I will be setting
>> up Beam on her new laptop together, so for any new users looking to see how
>> to install Beam from source this one is for you (or for devs looking to see
>> how painful set up is) - https://www.twitch.tv/events
>> /YAYvNp3tT0COkcpNBxnp6A / https://www.youtube.com/watch?v=x8Wg7qCDA5k
>>
>> P.S.
>>
>> As always I'll be doing my regular Friday code reviews in Spark -
>> https://www.youtube.com/watch?v=O4rRx-3PTiM . You can see the other ones
>> I have planned on my twitch <https://www.twitch.tv/holdenkarau> events
>> <https://www.twitch.tv/holdenkarau/events> and youtube
>> <https://www.youtube.com/user/holdenkarau>.
>>
>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks! I've been doing some live coding in my other projects and I
>>> figured I'd do some with Apache Beam as well.
>>>
>>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>>> review tooling possibilities (looking at forking spark-pr-dashboard for
>>> other projects like beam and setting up mentionbot to work with ASF infra)
>>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>>
>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>>> trying to get easier dependency management for the Python portable runner
>>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>>
>>> If your interested in seeing more of the development process I hope you
>>> will join me :)
>>>
>>> P.S.
>>>
>>> You can also follow on twitch which does a better job of notifications
>>> https://www.twitch.tv/holdenkarau
>>>
>>> Also one of the other thing I do is "live reviews" of PRs but they are
>>> generally opt-in and I don't have enough opt-ins from the Beam community to
>>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>>> a live streamed review of your PRs let me know (if your curious to what
>>> they look like you can see some of them here in Spark land
>>> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
>>> ).
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Live coding & reviewing adventures

2018-07-30 Thread Holden Karau
So small schedule changes.
I’ll be doing some poking at the Go SDK at 2pm today -
https://www.youtube.com/watch?v=9UAu1DOZJhM and the one with Gris setting
up Beam on a new machine will be moved to Friday because her laptop got
delayed - https://www.youtube.com/watch?v=x8Wg7qCDA5k

On Tue, Jul 24, 2018 at 8:41 PM Holden Karau  wrote:

> I'll be doing this again this week & next looking at a few different
> topics.
>
> Tomorrow (July 25th @ 10am pacific) Gris & I will be updating the PR from
> my last live stream (adding Python dependency handling) -
> https://www.twitch.tv/events/P92irbgYR9Sx6nMQ-lGY3g /
> https://www.youtube.com/watch?v=4xDsY5QL2zM
>
> In the afternoon @ 3 pm pacific I'll be looking at the dev tools we've had
> some discussions around with respect to reviews - https://www.twitch.tv/
> events/vNzcZ7DdSuGFNYURW_9WEQ / https://www.youtube.com/
> watch?v=6cTmC_fP9B0
>
> Next week on Thursday August 1st @ 2pm pacific Gris & I will be setting up
> Beam on her new laptop together, so for any new users looking to see how to
> install Beam from source this one is for you (or for devs looking to see
> how painful set up is) - https://www.twitch.tv/
> events/YAYvNp3tT0COkcpNBxnp6A / https://www.youtube.com/watch?
> v=x8Wg7qCDA5k
>
> P.S.
>
> As always I'll be doing my regular Friday code reviews in Spark -
> https://www.youtube.com/watch?v=O4rRx-3PTiM . You can see the other ones
> I have planned on my twitch <https://www.twitch.tv/holdenkarau> events
> <https://www.twitch.tv/holdenkarau/events> and youtube
> <https://www.youtube.com/user/holdenkarau>.
>
> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
> wrote:
>
>> Hi folks! I've been doing some live coding in my other projects and I
>> figured I'd do some with Apache Beam as well.
>>
>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>> review tooling possibilities (looking at forking spark-pr-dashboard for
>> other projects like beam and setting up mentionbot to work with ASF infra)
>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>
>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>> trying to get easier dependency management for the Python portable runner
>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>
>> If your interested in seeing more of the development process I hope you
>> will join me :)
>>
>> P.S.
>>
>> You can also follow on twitch which does a better job of notifications
>> https://www.twitch.tv/holdenkarau
>>
>> Also one of the other thing I do is "live reviews" of PRs but they are
>> generally opt-in and I don't have enough opt-ins from the Beam community to
>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>> a live streamed review of your PRs let me know (if your curious to what
>> they look like you can see some of them here in Spark land
>> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
>> ).
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


Re: Live coding & reviewing adventures

2018-07-24 Thread Holden Karau
I'll be doing this again this week & next looking at a few different topics.

Tomorrow (July 25th @ 10am pacific) Gris & I will be updating the PR from
my last live stream (adding Python dependency handling) -
https://www.twitch.tv/events/P92irbgYR9Sx6nMQ-lGY3g /
https://www.youtube.com/watch?v=4xDsY5QL2zM

In the afternoon @ 3 pm pacific I'll be looking at the dev tools we've had
some discussions around with respect to reviews - https://www.twitch.tv/
events/vNzcZ7DdSuGFNYURW_9WEQ / https://www.youtube.com/watch?v=6cTmC_fP9B0

Next week on Thursday August 1st @ 2pm pacific Gris & I will be setting up
Beam on her new laptop together, so for any new users looking to see how to
install Beam from source this one is for you (or for devs looking to see
how painful set up is) - https://www.twitch.tv/events/YAYvNp3tT0COkcpNBxnp6A
/ https://www.youtube.com/watch?v=x8Wg7qCDA5k

P.S.

As always I'll be doing my regular Friday code reviews in Spark -
https://www.youtube.com/watch?v=O4rRx-3PTiM . You can see the other ones I
have planned on my twitch <https://www.twitch.tv/holdenkarau> events
<https://www.twitch.tv/holdenkarau/events> and youtube
<https://www.youtube.com/user/holdenkarau>.

On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau  wrote:

> Hi folks! I've been doing some live coding in my other projects and I
> figured I'd do some with Apache Beam as well.
>
> Today @ 3pm pacific I'm going be doing some impromptu exploration better
> review tooling possibilities (looking at forking spark-pr-dashboard for
> other projects like beam and setting up mentionbot to work with ASF infra)
> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>
> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
> trying to get easier dependency management for the Python portable runner
> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>
> If your interested in seeing more of the development process I hope you
> will join me :)
>
> P.S.
>
> You can also follow on twitch which does a better job of notifications
> https://www.twitch.tv/holdenkarau
>
> Also one of the other thing I do is "live reviews" of PRs but they are
> generally opt-in and I don't have enough opt-ins from the Beam community to
> do live reviews in Beam, if you work on Beam and would be OK with me doing
> a live streamed review of your PRs let me know (if your curious to what
> they look like you can see some of them here in Spark land
> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
> ).
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Proof-of-concept Beam PR dashboard (based off of Spark's PR dashboard) to improve discoverability

2018-07-24 Thread Holden Karau
That one's probably going to be more work, but the code is open and I'd be
happy to help where I can.

On Tue, Jul 24, 2018, 12:58 PM Huygaa Batsaikhan  wrote:

> This is great. From previous thread
> <https://lists.apache.org/thread.html/6138d08c551e254b5f13b26c6ba06579a49a4694f4d13ad6d164689a@%3Cdev.beam.apache.org%3E>,
> "whose turn" feature was a popular request for the dashboard because it is
> hard to know whose attention is needed at any moment.
> How much effort is needed to implement such feature on top of the
> dashboard?
>
> On Fri, Jul 13, 2018 at 5:56 PM Holden Karau  wrote:
>
>> Took me waaay longer than planed, and the regexes and components could
>> use some work, but I've got a quick Beam PR dashboard up at
>> https://boos-demo-projects-are-rad.appspot.com/. The code is a fork of
>> the Spark one, and its at
>> https://github.com/holdenk/spark-pr-dashboard/tree/support-beam in the
>> beam support branch. I don't how useful this will be for folks, but given
>> the discussion going on around CODEOWNERS I figured people were feeling the
>> pain of trying to keep on top of reviews.
>>
>> I'm still working on trying to get mentionbot working (its being a bit
>> frustrating to upgrade to recent version of dependencies as a non-JS
>> programmer), but hopefully I can do something there too.
>>
>> If anyone has thoughts about what good tags would be for the review
>> dashboard let me know, I just kicked it off with some tabs which I
>> personally care about.
>>
>> Twitter: https://twitter.com/holdenkarau
>>
>


Re: Live coding & reviewing adventures

2018-07-18 Thread Holden Karau
Ok so follow up I'll be doing part 2 today at noon pacific today -
https://www.youtube.com/watch?v=6krU3YWsgYQ . If your @oscon come see the
talk w/the demo (and other thins) at 2:30 pm pacific in Portland 251 -
https://conferences.oreilly.com/oscon/oscon-or/public/schedule/speaker/128567

As for the venv reqs:

absl-py==0.2.2
apache-beam==2.6.0.dev0
astor==0.7.1
avro==1.8.2
backports-abc==0.5
backports.shutil-get-terminal-size==1.0.0
backports.weakref==1.0.post1
bleach==2.1.3
cachetools==2.1.0
certifi==2018.4.16
chardet==3.0.4
-e git+
https://github.com/holdenk/model-analysis.git@2cee83428d4db58fbba987f3e268114f3ac0e694#egg=chicago_taxi_setup=examples/chicago_taxi
configparser==3.5.0
crcmod==1.7
decorator==4.3.0
dill==0.2.6
docopt==0.6.2
entrypoints==0.2.3
enum34==1.1.6
fastavro==0.19.7
fasteners==0.14.1
funcsigs==1.0.2
functools32==3.2.3.post2
future==0.16.0
futures==3.2.0
gapic-google-cloud-pubsub-v1==0.15.4
gast==0.2.0
google-api-core==1.2.1
google-apitools==0.5.20
google-auth==1.5.0
google-auth-httplib2==0.0.3
google-cloud-bigquery==0.25.0
google-cloud-core==0.25.0
google-cloud-pubsub==0.26.0
google-gax==0.15.16
google-resumable-media==0.3.1
googleapis-common-protos==1.5.3
googledatastore==7.0.1
grpc-google-iam-v1==0.11.1
grpcio==1.13.0
hdfs==2.1.0
html5lib==0.999
httplib2==0.9.2
idna==2.7
ipykernel==4.8.2
ipython==5.7.0
ipython-genutils==0.2.0
ipywidgets==7.2.1
Jinja2==2.10
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.3
jupyter-console==5.2.0
jupyter-core==4.4.0
Markdown==2.6.11
MarkupSafe==1.0
mistune==0.8.3
mock==2.0.0
monotonic==1.5
nbconvert==5.3.1
nbformat==4.4.0
notebook==5.5.0
numpy==1.14.5
oauth2client==4.1.2
pandocfilters==1.4.2
pathlib2==2.3.2
pbr==4.1.0
pexpect==4.6.0
pickleshare==0.7.4
pkg-resources==0.0.0
ply==3.8
prompt-toolkit==1.0.15
proto-google-cloud-datastore-v1==0.90.4
proto-google-cloud-pubsub-v1==0.15.4
protobuf==3.6.0
ptyprocess==0.6.0
pyasn1==0.4.3
pyasn1-modules==0.2.2
Pygments==2.2.0
python-dateutil==2.7.3
pytz==2018.4
PyVCF==0.6.8
PyYAML==3.13
pyzmq==17.0.0
qtconsole==4.3.1
requests==2.19.1
rsa==3.4.2
scandir==1.7
Send2Trash==1.5.0
simplegeneric==0.8.1
singledispatch==3.4.0.3
six==1.11.0
tensorboard==1.6.0
tensorflow==1.6.0
tensorflow-model-analysis==0.6.0
tensorflow-serving-api==1.6.0
tensorflow-transform==0.6.0
termcolor==1.1.0
terminado==0.8.1
testpath==0.3.1
tornado==5.0.2
traitlets==4.3.2
typing==3.6.4
urllib3==1.23
wcwidth==0.1.7
Werkzeug==0.14.1
widgetsnbextension==3.2.1



On Wed, Jul 18, 2018 at 8:19 AM, Holden Karau  wrote:

> That’s a thing I’ve been thinking about but haven’t had the time to do
> yet. It’s a bit tricky because I don’t always know what I’m doing before I
> start and remembering to go back and tag things after a long stream is hard.
>
> On Tue, Jul 17, 2018 at 11:11 PM Ismaël Mejía  wrote:
>
>> Have you thought about creating some sort of index page for your past
>> live streams?
>> At least for the non-review ones it can provide great value given that
>> searching videos is not the easiest thing to do.
>> On Wed, Jul 18, 2018 at 12:51 AM Holden Karau 
>> wrote:
>> >
>> > Sure! I’ll respond with a pip freeze when I land.
>
> also ooos forgot to do that, I’ll do it today.
>
>>
>> >
>> > On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi 
>> wrote:
>> >>
>> >> Could u publish the python transitive deps some place that have the
>> Beam-Flink runner working ?
>> >>
>> >> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau 
>> wrote:
>> >>>
>> >>> And I've got an hour to kill @ SFO today so at some of the
>> suggestions from folks I'm going to do a more user focused one trying
>> getting the TFT demo to work with the portable flink runner (hopefully) -
>> https://www.youtube.com/watch?v=wL9mvQeN36E
>> >>>
>> >>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>> >>>>
>> >>>> Hi folks! I've been doing some live coding in my other projects and
>> I figured I'd do some with Apache Beam as well.
>> >>>>
>> >>>> Today @ 3pm pacific I'm going be doing some impromptu exploration
>> better review tooling possibilities (looking at forking spark-pr-dashboard
>> for other projects like beam and setting up mentionbot to work with ASF
>> infra) - https://www.youtube.com/watch?v=ff8_jbzC8JI
>> >>>>
>> >>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working
>> on trying to get easier dependency management for the Python portable
>> runner in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>> >>>>
>> >>>> If your interested in seeing more of the development process I hope
>> you will join me :)
>&g

Re: Live coding & reviewing adventures

2018-07-18 Thread Holden Karau
That’s a thing I’ve been thinking about but haven’t had the time to do yet.
It’s a bit tricky because I don’t always know what I’m doing before I start
and remembering to go back and tag things after a long stream is hard.

On Tue, Jul 17, 2018 at 11:11 PM Ismaël Mejía  wrote:

> Have you thought about creating some sort of index page for your past
> live streams?
> At least for the non-review ones it can provide great value given that
> searching videos is not the easiest thing to do.
> On Wed, Jul 18, 2018 at 12:51 AM Holden Karau 
> wrote:
> >
> > Sure! I’ll respond with a pip freeze when I land.

also ooos forgot to do that, I’ll do it today.

>
> >
> > On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi 
> wrote:
> >>
> >> Could u publish the python transitive deps some place that have the
> Beam-Flink runner working ?
> >>
> >> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau 
> wrote:
> >>>
> >>> And I've got an hour to kill @ SFO today so at some of the suggestions
> from folks I'm going to do a more user focused one trying getting the TFT
> demo to work with the portable flink runner (hopefully) -
> https://www.youtube.com/watch?v=wL9mvQeN36E
> >>>
> >>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
> wrote:
> >>>>
> >>>> Hi folks! I've been doing some live coding in my other projects and I
> figured I'd do some with Apache Beam as well.
> >>>>
> >>>> Today @ 3pm pacific I'm going be doing some impromptu exploration
> better review tooling possibilities (looking at forking spark-pr-dashboard
> for other projects like beam and setting up mentionbot to work with ASF
> infra) - https://www.youtube.com/watch?v=ff8_jbzC8JI
> >>>>
> >>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working
> on trying to get easier dependency management for the Python portable
> runner in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
> >>>>
> >>>> If your interested in seeing more of the development process I hope
> you will join me :)
> >>>>
> >>>> P.S.
> >>>>
> >>>> You can also follow on twitch which does a better job of
> notifications https://www.twitch.tv/holdenkarau
> >>>>
> >>>> Also one of the other thing I do is "live reviews" of PRs but they
> are generally opt-in and I don't have enough opt-ins from the Beam
> community to do live reviews in Beam, if you work on Beam and would be OK
> with me doing a live streamed review of your PRs let me know (if your
> curious to what they look like you can see some of them here in Spark land).
> >>>>
> >>>> --
> >>>> Twitter: https://twitter.com/holdenkarau
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Twitter: https://twitter.com/holdenkarau
> >>
> >>
> > --
> > Twitter: https://twitter.com/holdenkarau
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Live coding & reviewing adventures

2018-07-17 Thread Holden Karau
Sure! I’ll respond with a pip freeze when I land.

On Tue, Jul 17, 2018 at 2:28 PM Suneel Marthi  wrote:

> Could u publish the python transitive deps some place that have the
> Beam-Flink runner working ?
>
> On Tue, Jul 17, 2018 at 5:26 PM, Holden Karau 
> wrote:
>
>> And I've got an hour to kill @ SFO today so at some of the suggestions
>> from folks I'm going to do a more user focused one trying getting the TFT
>> demo to work with the portable flink runner (hopefully) -
>> https://www.youtube.com/watch?v=wL9mvQeN36E
>>
>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks! I've been doing some live coding in my other projects and I
>>> figured I'd do some with Apache Beam as well.
>>>
>>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>>> review tooling possibilities (looking at forking spark-pr-dashboard for
>>> other projects like beam and setting up mentionbot to work with ASF infra)
>>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>>
>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>>> trying to get easier dependency management for the Python portable runner
>>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>>
>>> If your interested in seeing more of the development process I hope you
>>> will join me :)
>>>
>>> P.S.
>>>
>>> You can also follow on twitch which does a better job of notifications
>>> https://www.twitch.tv/holdenkarau
>>>
>>> Also one of the other thing I do is "live reviews" of PRs but they are
>>> generally opt-in and I don't have enough opt-ins from the Beam community to
>>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>>> a live streamed review of your PRs let me know (if your curious to what
>>> they look like you can see some of them here in Spark land
>>> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
>>> ).
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
> --
Twitter: https://twitter.com/holdenkarau


Re: Live coding & reviewing adventures

2018-07-17 Thread Holden Karau
Yup I’ve got a fork of that (I skip re-uploading the input data files since
conference WiFi) and store to gcs so I can see the output.

Didn’t finish everything in today’s stream because the WiFi was a bit
flakey during the first container build and stalled apt, but I’ll do a
follow up to finish it up.

On Tue, Jul 17, 2018 at 2:34 PM Ankur Goenka  wrote:

> +1
> For reference here is a sample job
> https://github.com/axelmagn/model-analysis/blob/axelmagn-hacks/examples/chicago_taxi/preprocess_flink.sh
>
> Also 1 quick heads up, output file will be created in docker container if
> you use local file system.
>
>
>
>
> On Tue, Jul 17, 2018 at 2:27 PM Holden Karau  wrote:
>
>> And I've got an hour to kill @ SFO today so at some of the suggestions
>> from folks I'm going to do a more user focused one trying getting the TFT
>> demo to work with the portable flink runner (hopefully) -
>> https://www.youtube.com/watch?v=wL9mvQeN36E
>>
>> On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau 
>> wrote:
>>
>>> Hi folks! I've been doing some live coding in my other projects and I
>>> figured I'd do some with Apache Beam as well.
>>>
>>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>>> review tooling possibilities (looking at forking spark-pr-dashboard for
>>> other projects like beam and setting up mentionbot to work with ASF infra)
>>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>>
>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>>> trying to get easier dependency management for the Python portable runner
>>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>>
>>> If your interested in seeing more of the development process I hope you
>>> will join me :)
>>>
>>> P.S.
>>>
>>> You can also follow on twitch which does a better job of notifications
>>> https://www.twitch.tv/holdenkarau
>>>
>>> Also one of the other thing I do is "live reviews" of PRs but they are
>>> generally opt-in and I don't have enough opt-ins from the Beam community to
>>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>>> a live streamed review of your PRs let me know (if your curious to what
>>> they look like you can see some of them here in Spark land
>>> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
>>> ).
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: Live coding & reviewing adventures

2018-07-17 Thread Holden Karau
And I've got an hour to kill @ SFO today so at some of the suggestions from
folks I'm going to do a more user focused one trying getting the TFT demo
to work with the portable flink runner (hopefully) -
https://www.youtube.com/watch?v=wL9mvQeN36E

On Fri, Jul 13, 2018 at 11:54 AM, Holden Karau  wrote:

> Hi folks! I've been doing some live coding in my other projects and I
> figured I'd do some with Apache Beam as well.
>
> Today @ 3pm pacific I'm going be doing some impromptu exploration better
> review tooling possibilities (looking at forking spark-pr-dashboard for
> other projects like beam and setting up mentionbot to work with ASF infra)
> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>
> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
> trying to get easier dependency management for the Python portable runner
> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>
> If your interested in seeing more of the development process I hope you
> will join me :)
>
> P.S.
>
> You can also follow on twitch which does a better job of notifications
> https://www.twitch.tv/holdenkarau
>
> Also one of the other thing I do is "live reviews" of PRs but they are
> generally opt-in and I don't have enough opt-ins from the Beam community to
> do live reviews in Beam, if you work on Beam and would be OK with me doing
> a live streamed review of your PRs let me know (if your curious to what
> they look like you can see some of them here in Spark land
> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
> ).
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: CODEOWNERS for apache/beam repo

2018-07-17 Thread Holden Karau
So it doesn’t support doing that right now, although if we find it’s a
problem we can specify an exclude file with folks who haven’t contributed
in the past year. Would people want me to generate that first?

On Tue, Jul 17, 2018 at 10:22 AM Ismaël Mejía  wrote:

> Is there a way to put inactive people as not reviewers for the blame
> case? I think it can be useful considering that a good amount of our
> committers are not active at the moment and auto-assigning reviews to
> them seem like a waste of energy/time.
> On Tue, Jul 17, 2018 at 1:58 AM Eugene Kirpichov 
> wrote:
> >
> > We did not, but I think we should. So far, in 100% of the PRs I've
> authored, the default functionality of CODEOWNERS did the wrong thing and I
> had to fix something up manually.
> >
> > On Mon, Jul 16, 2018 at 3:42 PM Andrew Pilloud 
> wrote:
> >>
> >> This sounds like a good plan. Did we want to rename the CODEOWNERS file
> to disable github's mass adding of reviewers while we figure this out?
> >>
> >> Andrew
> >>
> >> On Mon, Jul 16, 2018 at 10:20 AM Jean-Baptiste Onofré 
> wrote:
> >>>
> >>> +1
> >>>
> >>> Le 16 juil. 2018, à 19:17, Holden Karau  a
> écrit:
> >>>>
> >>>> Ok if no one objects I'll create the INFRA ticket after OSCON and we
> can test it for a week and decide if it helps or hinders.
> >>>>
> >>>> On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré < j...@nanthrax.net>
> wrote:
> >>>>>
> >>>>> Agree to test it for a week.
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>> Le 16 juil. 2018, à 18:59, Holden Karau < holden.ka...@gmail.com> a
> écrit:
> >>>>>>
> >>>>>> Would folks be OK with me asking infra to turn on blame based
> suggestions for Beam and trying it out for a week?
> >>>>>>
> >>>>>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez <
> rfern...@google.com> wrote:
> >>>>>>>
> >>>>>>> +1 using blame -- nifty :)
> >>>>>>>
> >>>>>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan <
> bat...@google.com> wrote:
> >>>>>>>>
> >>>>>>>> +1. This is great.
> >>>>>>>>
> >>>>>>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com>
> wrote:
> >>>>>>>>>
> >>>>>>>>> Mention bot looks cool, as it tries to guess the reviewer using
> blame.
> >>>>>>>>> I've written a quick and dirty script that uses only CODEOWNERS.
> >>>>>>>>>
> >>>>>>>>> Its output looks like:
> >>>>>>>>> $ python suggest_reviewers.py --pr 5940
> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
> (path_pattern: /runners/core-construction-java*)
> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
> (path_pattern: /runners/core-construction-java*)
> >>>>>>>>> INFO:root:Selected reviewer @echauchot for:
> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
> (path_pattern: /runners/core-java*)
> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
> /runners/flink/build.gradle (path_pattern: */build.gradle*)
> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
> (path_pattern: *.java)
> >>>>>>>>> INFO:root:Selected reviewer @pabloem for:
> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
> (path_pattern: /runners/google-cloud-dataflow-java*)
> >>>>>>>>> INFO:root:Selected reviewer @lukecwik for:
> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
> (path_pattern: /sdks/java/core*)
> >>>>>>>>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
> >>>>>>>>>
> >>>>>>>>> Script is in: https://github.com/apache/bea

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Holden Karau
Ok if no one objects I'll create the INFRA ticket after OSCON and we can
test it for a week and decide if it helps or hinders.

On Mon, Jul 16, 2018, 7:12 PM Jean-Baptiste Onofré  wrote:

> Agree to test it for a week.
>
> Regards
> JB
> Le 16 juil. 2018, à 18:59, Holden Karau  a écrit:
>>
>> Would folks be OK with me asking infra to turn on blame based suggestions
>> for Beam and trying it out for a week?
>>
>> On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez < rfern...@google.com>
>> wrote:
>>
>>> +1 using blame -- nifty :)
>>>
>>> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan < bat...@google.com>
>>> wrote:
>>>
>>>> +1. This is great.
>>>>
>>>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri < eh...@google.com> wrote:
>>>>
>>>>> Mention bot looks cool, as it tries to guess the reviewer using blame.
>>>>> I've written a quick and dirty script that uses only CODEOWNERS.
>>>>>
>>>>> Its output looks like:
>>>>> $ python suggest_reviewers.py --pr 5940
>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>>>>> (path_pattern: /runners/core-construction-java*)
>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>>>>> (path_pattern: /runners/core-construction-java*)
>>>>> INFO:root:Selected reviewer @echauchot for:
>>>>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>>>>> (path_pattern: /runners/core-java*)
>>>>> INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
>>>>> (path_pattern: */build.gradle*)
>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>>>>> (path_pattern: *.java)
>>>>> INFO:root:Selected reviewer @pabloem for:
>>>>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>>>>> (path_pattern: /runners/google-cloud-dataflow-java*)
>>>>> INFO:root:Selected reviewer @lukecwik for:
>>>>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>>>>> (path_pattern: /sdks/java/core*)
>>>>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>>>>>
>>>>> Script is in: https://github.com/apache/beam/pull/5951
>>>>>
>>>>>
>>>>> What does the community think? Do you prefer blame-based or
>>>>> rules-based reviewer suggestions?
>>>>>
>>>>> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau < hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> I'm looking at something similar in the Spark project, and while it's
>>>>>> now archived by FB it seems like something like
>>>>>> https://github.com/facebookarchive/mention-bot might do what we
>>>>>> want. I'm going to spin up a version on my K8 cluster and see if I can 
>>>>>> ask
>>>>>> infra to add a webhook and if it works for Spark we could ask INFRA to 
>>>>>> add
>>>>>> a second webhook for Beam. (Or if the Beam folks are more interested in
>>>>>> experimenting I can do Beam first as a smaller project and roll with 
>>>>>> that).
>>>>>>
>>>>>> Let me know :)
>>>>>>
>>>>>> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
>>>>>> kirpic...@google.com> wrote:
>>>>>>
>>>>>>> Sounds reasonable for now, thanks!
>>>>>>> It's unfortunate that Github's CODEOWNERS feature appears to be
>>>>>>> effectively unusable for Beam but I'd hope that Github might pay 
>>>>>>> attention
>>>>>>> and fix things if we submit feedback, with us being one of the most 
>>>>>>> active
>>>>>>> Apache projects - did anyone do this yet / planning to?
>>>>>>>
>>>>>>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri < eh...@google.com>
>>>>&

Re: CODEOWNERS for apache/beam repo

2018-07-16 Thread Holden Karau
Would folks be OK with me asking infra to turn on blame based suggestions
for Beam and trying it out for a week?

On Mon, Jul 16, 2018, 6:53 PM Rafael Fernandez  wrote:

> +1 using blame -- nifty :)
>
> On Mon, Jul 16, 2018 at 2:31 AM Huygaa Batsaikhan 
> wrote:
>
>> +1. This is great.
>>
>> On Sat, Jul 14, 2018 at 7:44 AM Udi Meiri  wrote:
>>
>>> Mention bot looks cool, as it tries to guess the reviewer using blame.
>>> I've written a quick and dirty script that uses only CODEOWNERS.
>>>
>>> Its output looks like:
>>> $ python suggest_reviewers.py --pr 5940
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/PTransformMatchers.java
>>> (path_pattern: /runners/core-construction-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/SplittableParDoNaiveBounded.java
>>> (path_pattern: /runners/core-construction-java*)
>>> INFO:root:Selected reviewer @echauchot for:
>>> /runners/core-java/src/main/java/org/apache/beam/runners/core/SplittableParDoViaKeyedWorkItems.java
>>> (path_pattern: /runners/core-java*)
>>> INFO:root:Selected reviewer @lukecwik for: /runners/flink/build.gradle
>>> (path_pattern: */build.gradle*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkTransformOverrides.java
>>> (path_pattern: *.java)
>>> INFO:root:Selected reviewer @pabloem for:
>>> /runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java
>>> (path_pattern: /runners/google-cloud-dataflow-java*)
>>> INFO:root:Selected reviewer @lukecwik for:
>>> /sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/SplittableDoFnTest.java
>>> (path_pattern: /sdks/java/core*)
>>> Suggested reviewers: @echauchot, @lukecwik, @pabloem
>>>
>>> Script is in: https://github.com/apache/beam/pull/5951
>>>
>>>
>>> What does the community think? Do you prefer blame-based or rules-based
>>> reviewer suggestions?
>>>
>>> On Fri, Jul 13, 2018 at 11:13 AM Holden Karau 
>>> wrote:
>>>
>>>> I'm looking at something similar in the Spark project, and while it's
>>>> now archived by FB it seems like something like
>>>> https://github.com/facebookarchive/mention-bot might do what we want.
>>>> I'm going to spin up a version on my K8 cluster and see if I can ask infra
>>>> to add a webhook and if it works for Spark we could ask INFRA to add a
>>>> second webhook for Beam. (Or if the Beam folks are more interested in
>>>> experimenting I can do Beam first as a smaller project and roll with that).
>>>>
>>>> Let me know :)
>>>>
>>>> On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov <
>>>> kirpic...@google.com> wrote:
>>>>
>>>>> Sounds reasonable for now, thanks!
>>>>> It's unfortunate that Github's CODEOWNERS feature appears to be
>>>>> effectively unusable for Beam but I'd hope that Github might pay attention
>>>>> and fix things if we submit feedback, with us being one of the most active
>>>>> Apache projects - did anyone do this yet / planning to?
>>>>>
>>>>> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:
>>>>>
>>>>>> While I like the idea of having a CODEOWNERS file, the Github
>>>>>> implementation is lacking:
>>>>>> 1. Reviewers are automatically assigned at each push.
>>>>>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's
>>>>>> PR 5940).
>>>>>> 3. Non-committers aren't assigned as reviewers.
>>>>>> 4. Non-committers can't change the list of reviewers.
>>>>>>
>>>>>> I propose renaming the file to disable the auto-reviewer assignment
>>>>>> feature.
>>>>>> In its place I'll add a script that suggests reviewers.
>>>>>>
>>>>>> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:
>>>>>>
>>>>>>> Hi Etienne,
>>>>>>>
>>>>>>> Yes you could be as precise as you want. The paths I listed are just
>>>>>>> suggestions. :)
>>>>>>>
>>>>>>>
>>>>&g

Proof-of-concept Beam PR dashboard (based off of Spark's PR dashboard) to improve discoverability

2018-07-13 Thread Holden Karau
Took me waaay longer than planed, and the regexes and components could use
some work, but I've got a quick Beam PR dashboard up at
https://boos-demo-projects-are-rad.appspot.com/. The code is a fork of the
Spark one, and its at
https://github.com/holdenk/spark-pr-dashboard/tree/support-beam in the beam
support branch. I don't how useful this will be for folks, but given the
discussion going on around CODEOWNERS I figured people were feeling the
pain of trying to keep on top of reviews.

I'm still working on trying to get mentionbot working (its being a bit
frustrating to upgrade to recent version of dependencies as a non-JS
programmer), but hopefully I can do something there too.

If anyone has thoughts about what good tags would be for the review
dashboard let me know, I just kicked it off with some tabs which I
personally care about.

Twitter: https://twitter.com/holdenkarau


Re: Live coding & reviewing adventures

2018-07-13 Thread Holden Karau
Thats a great idea! I did something like that earlier focused on the Go SDK
only (see - https://www.youtube.com/watch?v=g0Iq4np-WVk &
https://www.youtube.com/watch?v=P4jIfhPTKQo ), and I'll try and do some
more general ones later on as we get more stuff working on the portability
framework with Flink :)

On Fri, Jul 13, 2018 at 12:39 PM, Ankur Goenka  wrote:

> Thanks Holden for doing this.
> Looking forward to attend the live session.
> Suggestion: It will super useful to do a live session for beam setup for
> users and another for contributors.
>
>
> On Fri, Jul 13, 2018 at 12:33 PM Innocent Djiofack 
> wrote:
>
>> Thanks I think this will be super useful. I will tune in.
>>
>> On Fri, Jul 13, 2018 at 2:54 PM Holden Karau 
>> wrote:
>>
>>> Hi folks! I've been doing some live coding in my other projects and I
>>> figured I'd do some with Apache Beam as well.
>>>
>>> Today @ 3pm pacific I'm going be doing some impromptu exploration better
>>> review tooling possibilities (looking at forking spark-pr-dashboard for
>>> other projects like beam and setting up mentionbot to work with ASF infra)
>>> - https://www.youtube.com/watch?v=ff8_jbzC8JI
>>>
>>> Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
>>> trying to get easier dependency management for the Python portable runner
>>> in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA
>>>
>>> If your interested in seeing more of the development process I hope you
>>> will join me :)
>>>
>>> P.S.
>>>
>>> You can also follow on twitch which does a better job of notifications
>>> https://www.twitch.tv/holdenkarau
>>>
>>> Also one of the other thing I do is "live reviews" of PRs but they are
>>> generally opt-in and I don't have enough opt-ins from the Beam community to
>>> do live reviews in Beam, if you work on Beam and would be OK with me doing
>>> a live streamed review of your PRs let me know (if your curious to what
>>> they look like you can see some of them here in Spark land
>>> <https://www.youtube.com/watch?v=keDa4NPoUj0=15=PLRLebp9QyZtYF46jlSnIu2x1NDBkKa2uw>
>>> ).
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
>>
>> *DJIOFACK INNOCENT*
>> *"Be better than the day before!" -*
>> *+1 404 751 8024*
>>
>


-- 
Twitter: https://twitter.com/holdenkarau


Live coding & reviewing adventures

2018-07-13 Thread Holden Karau
Hi folks! I've been doing some live coding in my other projects and I
figured I'd do some with Apache Beam as well.

Today @ 3pm pacific I'm going be doing some impromptu exploration better
review tooling possibilities (looking at forking spark-pr-dashboard for
other projects like beam and setting up mentionbot to work with ASF infra)
- https://www.youtube.com/watch?v=ff8_jbzC8JI

Next week (Thursday the 19th at 2pm pacific) I'm going to be working on
trying to get easier dependency management for the Python portable runner
in place - https://www.youtube.com/watch?v=Sv0XhS2pYqA

If your interested in seeing more of the development process I hope you
will join me :)

P.S.

You can also follow on twitch which does a better job of notifications
https://www.twitch.tv/holdenkarau

Also one of the other thing I do is "live reviews" of PRs but they are
generally opt-in and I don't have enough opt-ins from the Beam community to
do live reviews in Beam, if you work on Beam and would be OK with me doing
a live streamed review of your PRs let me know (if your curious to what
they look like you can see some of them here in Spark land

).

-- 
Twitter: https://twitter.com/holdenkarau


Re: CODEOWNERS for apache/beam repo

2018-07-13 Thread Holden Karau
I'm looking at something similar in the Spark project, and while it's now
archived by FB it seems like something like
https://github.com/facebookarchive/mention-bot might do what we want. I'm
going to spin up a version on my K8 cluster and see if I can ask infra to
add a webhook and if it works for Spark we could ask INFRA to add a second
webhook for Beam. (Or if the Beam folks are more interested in
experimenting I can do Beam first as a smaller project and roll with that).

Let me know :)

On Fri, Jul 13, 2018 at 10:53 AM, Eugene Kirpichov 
wrote:

> Sounds reasonable for now, thanks!
> It's unfortunate that Github's CODEOWNERS feature appears to be
> effectively unusable for Beam but I'd hope that Github might pay attention
> and fix things if we submit feedback, with us being one of the most active
> Apache projects - did anyone do this yet / planning to?
>
> On Fri, Jul 13, 2018 at 10:23 AM Udi Meiri  wrote:
>
>> While I like the idea of having a CODEOWNERS file, the Github
>> implementation is lacking:
>> 1. Reviewers are automatically assigned at each push.
>> 2. Reviewer assignment can be excessive (e.g. 5 reviewers in Eugene's PR
>> 5940).
>> 3. Non-committers aren't assigned as reviewers.
>> 4. Non-committers can't change the list of reviewers.
>>
>> I propose renaming the file to disable the auto-reviewer assignment
>> feature.
>> In its place I'll add a script that suggests reviewers.
>>
>> On Fri, Jul 13, 2018 at 9:09 AM Udi Meiri  wrote:
>>
>>> Hi Etienne,
>>>
>>> Yes you could be as precise as you want. The paths I listed are just
>>> suggestions. :)
>>>
>>>
>>> On Fri, Jul 13, 2018 at 1:12 AM Jean-Baptiste Onofré 
>>> wrote:
>>>
 Hi,

 I think it's already do-able just providing the expected path.

 It's a good idea especially for the core.

 Regards
 JB

 On 13/07/2018 09:51, Etienne Chauchot wrote:
 > Hi Udi,
 >
 > I also have a question, related to what Eugene asked : I see that the
 > code paths are the ones of the modules. Can we be more precise than
 that
 > to assign reviewers ? As an example, I added myself to runner/core
 > because I wanted to take a look at the PRs related to
 > runner/core/metrics but I'm getting assigned to all runner-core PRs.
 Can
 > we specify paths like
 > runners/core-java/src/main/java/org/apache/beam/runners/core/metrics
 ?
 > I know it is a bit too precise so a bit risky, but in that particular
 > case, I doubt that the path will change.
 >
 > Etienne
 >
 > Le jeudi 12 juillet 2018 à 16:49 -0700, Eugene Kirpichov a écrit :
 >> Hi Udi,
 >>
 >> I see that the PR was merged - thanks! However it seems to have some
 >> unintended effects.
 >>
 >> On my PR https://github.com/apache/beam/pull/5940 , I assigned a
 >> reviewer manually, but the moment I pushed a new commit, it
 >> auto-assigned a lot of other people to it, and I had to remove them.
 >> This seems like a big inconvenience to me, is there a way to disable
 this?
 >>
 >> Thanks.
 >>
 >> On Thu, Jul 12, 2018 at 2:53 PM Udi Meiri >>> >> > wrote:
 >>> :/ That makes it a little less useful.
 >>>
 >>> On Thu, Jul 12, 2018 at 11:14 AM Tim Robertson
 >>> mailto:timrobertson...@gmail.com>>
 wrote:
  Hi Udi
 
  I asked the GH helpdesk and they confirmed that only people with
  write access will actually be automatically chosen.
 
  It don't expect it should stop us using it, but we should be aware
  that there are non-committers also willing to review.
 
  Thanks,
  Tim
 
  On Thu, Jul 12, 2018 at 7:24 PM, Mikhail Gryzykhin
  mailto:mig...@google.com>> wrote:
 > Idea looks good in general.
 >
 > Did you look into ways to keep this file up-to-date? For example
 we
 > can run monthly job to see if owner was active during this period.
 >
 > --Mikhail
 >
 > Have feedback ?
 >
 >
 > On Thu, Jul 12, 2018 at 9:56 AM Udi Meiri >>> > > wrote:
 >> Thanks all!
 >> I'll try to get the file merged today and see how it works out.
 >> Please surface any issues, such as with auto-assignment, here or
 >> in JIRA.
 >>
 >> On Thu, Jul 12, 2018 at 2:12 AM Etienne Chauchot
 >> mailto:echauc...@apache.org>> wrote:
 >>> Hi,
 >>>
 >>> I added myself as a reviewer for some modules.
 >>>
 >>> Etienne
 >>>
 >>> Le lundi 09 juillet 2018 à 17:06 -0700, Udi Meiri a écrit :
  Hi everyone,
 
  I'm proposing to add auto-reviewer-assignment using Github's
  CODEOWNERS mechanism.
  Initial version is
  

Re: Python Development Environments for Apache Beam

2018-06-20 Thread Holden Karau
Do you happen to have a tweet we reshould RT for reach?

On Wed, Jun 20, 2018, 11:26 AM Josh McGinley  wrote:

> Beam Users and Dev -
>
> I recently published a medium article
> 
>  showing how to set up Python Apache Beam pipelines for debugging in an
> IDE.
>
> I thought I would share the article with this community.  If you have any
> feedback let me know.  Otherwise keep up the great work on Beam!
>
> --
> Josh McGinley
>


Re: [ANNOUNCEMENT] New committers, May 2018 edition!

2018-06-01 Thread Holden Karau
Congrats all!

On Fri, Jun 1, 2018 at 12:12 AM Ismaël Mejía  wrote:

> Congratulations!
>
> On Fri, Jun 1, 2018 at 8:26 AM Pei HE  wrote:
> >
> > Congrats!
> >
> > On Fri, Jun 1, 2018 at 2:12 PM, Charles Chen  wrote:
> > > Congratulations everyone!
> > >
> > >
> > > On Thu, May 31, 2018, 10:14 PM Pablo Estrada 
> wrote:
> > >>
> > >> Thanks to the PMC! Very humbled and excited to keep taking part in
> this
> > >> great community.
> > >> :)
> > >> -P.
> > >>
> > >>
> > >> On Thu, May 31, 2018, 10:10 PM Tim  wrote:
> > >>>
> > >>> Congratulations!
> > >>>
> > >>>
> > >>> Tim
> > >>>
> > >>> On 1 Jun 2018, at 07:05, Andrew Psaltis 
> wrote:
> > >>>
> > >>> Congrats!
> > >>>
> > >>> On Fri, Jun 1, 2018 at 12:26 AM, Thomas Weise 
> wrote:
> > 
> >  Congrats!
> > 
> > 
> >  On Thu, May 31, 2018 at 9:25 PM, Alan Myrvold 
> >  wrote:
> > >
> > > Congrats Gris+Pablo+Jason. Well deserved.
> > >
> > > On Thu, May 31, 2018 at 9:15 PM Jason Kuster <
> jasonkus...@google.com>
> > > wrote:
> > >>
> > >> Thank you to Davor and the PMC; I'm excited to be able to help
> Beam in
> > >> this new capacity. Bring on the PRs. :D
> > >>
> > >> On Thu, May 31, 2018 at 8:55 PM Xin Wang 
> > >> wrote:
> > >>>
> > >>> Congrats!
> > >>>
> > >>> - Xin Wang
> > >>>
> > >>> 2018-06-01 11:52 GMT+08:00 Rui Wang :
> > 
> >  Congrats!
> > 
> >  -Rui
> > 
> >  On Thu, May 31, 2018 at 8:23 PM Jean-Baptiste Onofré
> >   wrote:
> > >
> > > Congrats !
> > >
> > > Regards
> > > JB
> > >
> > > On 01/06/2018 04:08, Davor Bonaci wrote:
> > > > Please join me and the rest of Beam PMC in welcoming the
> > > > following
> > > > contributors as our newest committers. They have
> significantly
> > > > contributed to the project in different ways, and we look
> forward
> > > > to
> > > > many more contributions in the future.
> > > >
> > > > * Griselda Cuevas
> > > > * Pablo Estrada
> > > > * Jason Kuster
> > > >
> > > > (Apologizes for a delayed announcement, and the lack of the
> usual
> > > > paragraph summarizing individual contributions.)
> > > >
> > > > Congratulations to all three! Welcome!
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Thanks,
> > >>> Xin
> > >>
> > >>
> > >>
> > >> --
> > >> ---
> > >> Jason Kuster
> > >> Apache Beam / Google Cloud Dataflow
> > >>
> > >> See something? Say something. go/jasonkuster-feedback
> > 
> > 
> > >>>
> > >> --
> > >> Got feedback? go/pabloem-feedback
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Go SDK

2018-05-22 Thread Holden Karau
+1 (non-binding), I've had a chance to work with the SDK and it's pretty
neat to see Beam add support for a language before the most of the big data
ecosystem.

On Mon, May 21, 2018 at 10:29 PM, Jean-Baptiste Onofré 
wrote:

> Hi Henning,
>
> SGA has been filed for the entire project during the incubation period.
>
> Here, we have to check if SGA/IP donation is clean for the Go SDK.
>
> We don't have a lot to do, just checked that we are clean on this front.
>
> Regards
> JB
>
> On 22/05/2018 06:42, Henning Rohde wrote:
>
>> Thanks everyone!
>>
>> Davor -- regarding your two comments:
>>* Robert mentioned that "SGA should have probably already been filed"
>> in the previous thread. I got the impression that nothing further was
>> needed. I'll follow up.
>>* The standard Go tooling basically always pulls directly from github,
>> so there is no real urgency here.
>>
>> Thanks,
>>   Henning
>>
>>
>> On Mon, May 21, 2018 at 9:30 PM Jean-Baptiste Onofré > > wrote:
>>
>> +1 (binding)
>>
>> I just want to check about SGA/IP/Headers.
>>
>> Thanks !
>> Regards
>> JB
>>
>> On 22/05/2018 03:02, Henning Rohde wrote:
>>  > Hi everyone,
>>  >
>>  > Now that the remaining issues have been resolved as discussed,
>> I'd like
>>  > to propose a formal vote on accepting the Go SDK into master. The
>> main
>>  > practical difference is that the Go SDK would be part of the
>> Apache Beam
>>  > release going forward.
>>  >
>>  > Highlights of the Go SDK:
>>  >   * Go user experience with natively-typed DoFns with (simulated)
>>  > generic types
>>  >   * Covers most of the Beam model: ParDo, GBK, CoGBK, Flatten,
>> Combine,
>>  > Windowing, ..
>>  >   * Includes several IO connectors: Datastore, BigQuery, PubSub,
>>  > extensible textio.
>>  >   * Supports the portability framework for both batch and
>> streaming,
>>  > notably the upcoming portable Flink runner
>>  >   * Supports a direct runner for small batch workloads and testing.
>>  >   * Includes pre-commit tests and post-commit integration tests.
>>  >
>>  > And last but not least
>>  >   *  includes contributions from several independent users and
>>  > developers, notably an IO connector for Datastore!
>>  >
>>  > Website: https://beam.apache.org/documentation/sdks/go/
>>  > Code: https://github.com/apache/beam/tree/master/sdks/go
>>  > Design: https://s.apache.org/beam-go-sdk-design-rfc
>>  >
>>  > Please vote:
>>  > [ ] +1, Approve that the Go SDK becomes an official part of Beam
>>  > [ ] -1, Do not approve (please provide specific comments)
>>  >
>>  > Thanks,
>>  >   The Gophers of Apache Beam
>>  >
>>  >
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Splittable DoFN in Spark discussion

2018-04-26 Thread Holden Karau
Yeah that's been the implied source of being able to be continuous, you
union with a receiver which produce an infinite number of batches (the
"never ending queue stream" but not actually a queuestream since they have
some limitations but our own implementation there of).

On Tue, Apr 24, 2018 at 11:54 PM, Reuven Lax <re...@google.com> wrote:

> Could we do this behind the scenes by writing a Receiver that publishes
> periodic pings?
>
> On Tue, Apr 24, 2018 at 10:09 PM Eugene Kirpichov <kirpic...@google.com>
> wrote:
>
>> Kenn - I'm arguing that in Spark SDF style computation can not be
>> expressed at all, and neither can Beam's timers.
>>
>> Spark, unlike Flink, does not have a timer facility (only state), and as
>> far as I can tell its programming model has no other primitive that can map
>> a finite RDD into an infinite DStream - the only way to create a new
>> infinite DStream appears to be to write a Receiver.
>>
>> I cc'd you because I'm wondering whether you've already investigated this
>> when considering whether timers can be implemented on the Spark runner.
>>
>> On Tue, Apr 24, 2018 at 2:53 PM Kenneth Knowles <k...@google.com> wrote:
>>
>>> I don't think I understand what the limitations of timers are that you
>>> are referring to. FWIW I would say implementing other primitives like SDF
>>> is an explicit non-goal for Beam state & timers.
>>>
>>> I got lost at some point in this thread, but is it actually necessary
>>> that a bounded PCollection maps to a finite/bounded structure in Spark?
>>> Skimming, I'm not sure if the problem is that we can't transliterate Beam
>>> to Spark (this might be a good sign) or that we can't express SDF style
>>> computation at all (seems far-fetched, but I could be convinced). Does
>>> doing a lightweight analysis and just promoting some things to be some kind
>>> of infinite representation help?
>>>
>>> Kenn
>>>
>>> On Tue, Apr 24, 2018 at 2:37 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>> Would like to revive this thread one more time.
>>>>
>>>> At this point I'm pretty certain that Spark can't support this out of
>>>> the box and we're gonna have to make changes to Spark.
>>>>
>>>> Holden, could you advise who would be some Spark experts (yourself
>>>> included :) ) who could advise what kind of Spark change would both support
>>>> this AND be useful to the regular Spark community (non-Beam) so that it has
>>>> a chance of finding support? E.g. is there any plan in Spark regarding
>>>> adding timers similar to Flink's or Beam's timers, maybe we could help out
>>>> with that?
>>>>
>>>> +Kenneth Knowles <k...@google.com> because timers suffer from the same
>>>> problem.
>>>>
>>>> On Thu, Apr 12, 2018 at 2:28 PM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>> (resurrecting thread as I'm back from leave)
>>>>>
>>>>> I looked at this mode, and indeed as Reuven points out it seems that
>>>>> it affects execution details, but doesn't offer any new APIs.
>>>>> Holden - your suggestions of piggybacking an unbounded-per-element SDF
>>>>> on top of an infinite stream would work if 1) there was just 1 element and
>>>>> 2) the work was guaranteed to be infinite.
>>>>>
>>>>> Unfortunately, both of these assumptions are insufficient. In
>>>>> particular:
>>>>>
>>>>> - 1: The SDF is applied to a PCollection; the PCollection itself may
>>>>> be unbounded; and the unbounded work done by the SDF happens for every
>>>>> element. E.g. we might have a Kafka topic on which names of Kafka topics
>>>>> arrive, and we may end up concurrently reading a continuously growing
>>>>> number of topics.
>>>>> - 2: The work per element is not necessarily infinite, it's just *not
>>>>> guaranteed to be finite* - the SDF is allowed at any moment to say
>>>>> "Okay, this restriction is done for real" by returning stop() from the
>>>>> @ProcessElement method. Continuing the Kafka example, e.g., it could do
>>>>> that if the topic/partition being watched is deleted. Having an infinite
>>>>> stream as a driver of this process would require being able to send a
>>>>> signal to the stream to stop itself.
>&g

Re: Add a (temporary) Portable Flink branch to the ASF repo?

2018-04-12 Thread Holden Karau
So I would be strongly in favour of adding it as a branch on the Apache
repo. This way other folks are more likely to be able to help with the
splitting up and merging process and also while Flink forward is behind us
getting in the practice of doing feature branches on the ASF repo for
collaboration instead of personal github accounts seems like a worthy goal.

On Thu, Apr 12, 2018 at 4:21 PM Robert Bradshaw  wrote:

> I suppose with the hackathon and flink forward behind us, I'm thinking we
> should start shifting gears more getting what we have into master in
> production state and less on continuing working on a hacking branch. If we
> think it'll fairly quick there's no big need to create an official branch,
> and if it's going to be long lived perhaps we should rethink our process.
> On Thu, Apr 12, 2018 at 3:44 PM Aljoscha Krettek 
> wrote:
>
> > I would also be in favour of adding a branch to our main repo. A random
> branch on some personal GitHub account can seem a bit sketchy and adding a
> branch to our repo could make it more visible for people that are
> interested.
>
>
>
> > On 12. Apr 2018, at 15:29, Ben Sidhom  wrote:
>
> > I would say that most of it is not suitable for direct merging. There are
> several reasons for this:
>
> > Most changes are built on upstream PRs that are either not submitted or
> have been rebased before submission.
> > There are some very hacky changes in the Python and Java SDKs to get
> portable pipelines working. For example, hard coding certain options and/or
> baking dependencies into the SDK harness images. These need to be actually
> implemented correctly in their respective SDKs.
> > Much of the code does not have proper tests and fails simple lint tests.
>
> > As a concrete example, I tried cherry-picking the changes from
> https://github.com/bsidhom/beam/pull/46 into master. This is a relatively
> simple change, but there were so many merge conflicts that in the end it
> was easier to just reimplement the changes atop master. More importantly,
> most changes will require refactoring before actually going in.
>
> > On Thu, Apr 12, 2018 at 3:16 PM, Robert Bradshaw 
> wrote:
>
> >> How much of this is not suitable to merging into master directly (not as
> >> is, but as separate PRs)?
> >> On Thu, Apr 12, 2018 at 3:10 PM Ben Sidhom  wrote:
>
> >> > Hey all,
>
> >> > I've been working on a proof-of-concept portable Flink runner with
> some
> >> other Beam contributors. We would like to have a point of reference for
> the
> >> rest of the Beam community as we integrate this work into master. It
> >> currently lives under
> >> https://github.com/bsidhom/beam/tree/hacking-job-server.
>
> >> > I would suggest pulling this into the main ASF repo under an
> >> appropriately-named branch (flink-portable-hacking?). The name should
> >> suggest the intention that this branch is not intended to be pulled into
> >> master as-is and that it should rather be used as a reference for now.
>
> >> > Thoughts?
>
> >> > --
> >> > -Ben
>
>
>
>
> > --
> > -Ben
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [PROPOSAL] Python 3 support

2018-03-27 Thread Holden Karau
On Tue, Mar 27, 2018 at 4:27 AM Robbe Sneyders <robbe.sneyd...@ml6.eu>
wrote:

> Hi Anand,
>
> Thanks for the feedback.
>
> It should be no problem to run everything on DataflowRunner as well.
> Are there any performance tests in place to check for performance
> regressions?
>
> Some questions were raised in the proposal document which I want to add to
> this conversation:
>
> The first comment was about the targeted python 3 versions. We proposed to
> target 3.6 since it is the latest version available and added 3.5 because
> 3.6 adoption seems rather low (hard to find any relevant sources on this
> though).
> If the beam community prefers 3.4, I would propose to target 3.4 only
> during porting and add 3.5 and 3.6 later so we don't slow down the porting
> progress. 3.4 has the advantage of already being installed on the workers
> and allows pySpark pipelines to be moved over to beam more easily.
> It would be great to get some opinions on this.
>
> Another comment was made on how to avoid regression during the porting
> progress.
> After applying step 1 and step 2, no python 3 compatibility lint warnings
> should remain, so it would be great if we could enforce this check for
> every pull request on an already updated subpackage.
> After applying step 3, all tests should run on python 3, so again it would
> be great if we can enforce these per updated subpackage.
> Any insights on how to best accomplish this?
>
So you can look at some of the recent changes to tox.ini in the git log to
see what we’ve done so far around this I suspect you can repeat that same
pattern.

>
> Thanks,
> Robbe
>
> On Fri, 23 Mar 2018 at 19:59 Ahmet Altay <al...@google.com> wrote:
>
>> Thank you Robbe.
>>
>> I reviewed the document it looks reasonable to me. I will touch on some
>> points that were not mentioned:
>> - Runner exercise different code paths. Doing auto conversions and
>> focusing on DirectRunner is not enough. It is worthwhile to run things on
>> DataflowRunner as well. This can be triggered from Jenkins. It will
>> validate that we are still compatible for python 2.
>> - Similar to above but with an eye on perf regressions.
>>
>> For project tracking on JIRA, please feel free to create any new issues,
>> close stale ones, or take ownership of any open issues. All JIRAs should be
>> assigned to the people actively working on them. If you wan to track it in
>> a separate way, you can also propose that. (For example a kanban board is
>> used for portability effort which is fully supported in JIRA.)
>>
>> I will also call out to a few other people in addition to Holden who
>> helped out or showed interest in helping with Python 3. @cclaus, @luke-zhu,
>> @udim, @robertwb, @charlesccychen, @tvalentyn. You can include these
>> people (and myself) for reviews and other questions that you have.
>>
>> Welcome again, and looking forward to your contributions.
>>
>> Thank you,
>> Ahmet
>>
>>
>>
>> On Fri, Mar 23, 2018 at 9:27 AM, Robbe Sneyders <robbe.sneyd...@ml6.eu>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> In the next month(s), me and my colleague Matthias will commit a lot of
>>> time and effort to python 3 support for beam and we would like to discuss
>>> the best way to go forward with this.
>>>
>>> We have drawn up a document [1] with a high level outline of the
>>> proposed approach and would like to get your feedback on this.
>>>
>>> The main Jira issue [2] for python 3 support has been mostly inactive
>>> for the past year. Other smaller issues have been opened, but it's hard to
>>> track the general progress. It would be great if anyone could offer some
>>> insights on how to best handle this project on Jira.
>>>
>>> @Holden Karau, you seem to have already put in a lot of effort to add
>>> python 3 support, so it would be great to get your insights and find a way
>>> to merge our efforts.
>>>
>>> Kind regards,
>>> Robbe
>>>
>>> [1]
>>> https://docs.google.com/document/d/1xDG0MWVlDKDPu_IW9gtMvxi2S9I0GB0VDTkPhjXT0nE/edit?usp=sharing
>>>
>>> [2] https://issues.apache.org/jira/browse/BEAM-1251
>>> --
>>>
>>> [image: https://ml6.eu] <https://ml6.eu/>
>>>
>>> * Robbe Sneyders*
>>>
>>> ML6 Gent
>>> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>
>>>
>>> M: +32 474 71 31 08 <+32%20474%2071%2031%2008>
>>>
>>
>> --
>
> [image: https://ml6.eu] <https://ml6.eu/>
>
> * Robbe Sneyders*
>
> ML6 Gent
> <https://www.google.be/maps/place/ML6/@51.037408,3.7044893,17z/data=!3m1!4b1!4m5!3m4!1s0x47c37161feeca14b:0xb8f72585fdd21c90!8m2!3d51.037408!4d3.706678?hl=nl>
>
> M: +32 474 71 31 08
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: Splittable DoFN in Spark discussion

2018-03-25 Thread Holden Karau
I mean the new mode is very much in the Dataset not the DStream API
(although you can use the Dataset API with the old modes too).

On Sun, Mar 25, 2018 at 9:11 PM, Reuven Lax <re...@google.com> wrote:

> But this new mode isn't a semantic change, right? It's moving away from
> micro batches into something that looks a lot like what Flink does -
> continuous processing with asynchronous snapshot boundaries.
>
> On Sun, Mar 25, 2018 at 9:01 PM Thomas Weise <t...@apache.org> wrote:
>
>> Hopefully the new "continuous processing mode" in Spark will enable SDF
>> implementation (and real streaming)?
>>
>> Thanks,
>> Thomas
>>
>>
>> On Sat, Mar 24, 2018 at 3:22 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>>
>>> On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov <kirpic...@google.com>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Fri, Mar 23, 2018, 11:17 PM Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <kirpic...@google.com>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>> wrote:
>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <
>>>>>>> kirpic...@google.com> wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <
>>>>>>>>> kirpic...@google.com> wrote:
>>>>>>>>>
>>>>>>>>>> Reviving this thread. I think SDF is a pretty big risk for Spark
>>>>>>>>>> runner streaming. Holden, is it correct that Spark appears to have 
>>>>>>>>>> no way
>>>>>>>>>> at all to produce an infinite DStream from a finite RDD? Maybe we can
>>>>>>>>>> somehow dynamically create a new DStream for every initial 
>>>>>>>>>> restriction,
>>>>>>>>>> said DStream being obtained using a Receiver that under the hood 
>>>>>>>>>> actually
>>>>>>>>>> runs the SDF? (this is of course less efficient than a timer-capable 
>>>>>>>>>> runner
>>>>>>>>>> would do, and I have doubts about the fault tolerance)
>>>>>>>>>>
>>>>>>>>> So on the streaming side we could simply do it with a fixed number
>>>>>>>>> of levels on DStreams. It’s not great but it would work.
>>>>>>>>>
>>>>>>>> Not sure I understand this. Let me try to clarify what SDF demands
>>>>>>>> of the runner. Imagine the following case: a file contains a list of
>>>>>>>> "master" Kafka topics, on which there are published additional Kafka 
>>>>>>>> topics
>>>>>>>> to read.
>>>>>>>>
>>>>>>>> PCollection masterTopics = TextIO.read().from(
>>>>>>>> masterTopicsFile)
>>>>>>>> PCollection nestedTopics = masterTopics.apply(ParDo(
>>>>>>>> ReadFromKafkaFn))
>>>>>>>> PCollection records = nestedTopics.apply(ParDo(
>>>>>>>> ReadFromKafkaFn))
>>>>>>>>
>>>>>>>> This exemplifies both use cases of a streaming SDF that emits
>>>>>>>> infinite output for every input:
>>>>>>>> - Applying it to a finite set of inputs (in this case to the result
>>>>>>>> of reading a text file)
>>>>>>>> - Applying it to an infinite set of inputs (i.e. having an
>>>>>>>> unbounded number of streams being read concurrently, each of the 
>>>>>>>> streams
>>>>>>>> themselves is unbounded too)
>>>>>>>>
>>>>>>>> Does the multi-level solution you have in mind work for this case?
>>>>>>>> I suppose the second case is harder, so we can focus on that.
>>>>>>>>
>>>>>>> So none of those 

Re: Splittable DoFN in Spark discussion

2018-03-25 Thread Holden Karau
That would certainly be good.

On Sun, Mar 25, 2018 at 9:01 PM, Thomas Weise <t...@apache.org> wrote:

> Hopefully the new "continuous processing mode" in Spark will enable SDF
> implementation (and real streaming)?
>
> Thanks,
> Thomas
>
>
> On Sat, Mar 24, 2018 at 3:22 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>>
>> On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>>
>>>
>>> On Fri, Mar 23, 2018, 11:17 PM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <
>>>>>> kirpic...@google.com> wrote:
>>>>>>
>>>>>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <
>>>>>>>> kirpic...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Reviving this thread. I think SDF is a pretty big risk for Spark
>>>>>>>>> runner streaming. Holden, is it correct that Spark appears to have no 
>>>>>>>>> way
>>>>>>>>> at all to produce an infinite DStream from a finite RDD? Maybe we can
>>>>>>>>> somehow dynamically create a new DStream for every initial 
>>>>>>>>> restriction,
>>>>>>>>> said DStream being obtained using a Receiver that under the hood 
>>>>>>>>> actually
>>>>>>>>> runs the SDF? (this is of course less efficient than a timer-capable 
>>>>>>>>> runner
>>>>>>>>> would do, and I have doubts about the fault tolerance)
>>>>>>>>>
>>>>>>>> So on the streaming side we could simply do it with a fixed number
>>>>>>>> of levels on DStreams. It’s not great but it would work.
>>>>>>>>
>>>>>>> Not sure I understand this. Let me try to clarify what SDF demands
>>>>>>> of the runner. Imagine the following case: a file contains a list of
>>>>>>> "master" Kafka topics, on which there are published additional Kafka 
>>>>>>> topics
>>>>>>> to read.
>>>>>>>
>>>>>>> PCollection masterTopics = TextIO.read().from(masterTopic
>>>>>>> sFile)
>>>>>>> PCollection nestedTopics = masterTopics.apply(ParDo(ReadF
>>>>>>> romKafkaFn))
>>>>>>> PCollection records = nestedTopics.apply(ParDo(ReadF
>>>>>>> romKafkaFn))
>>>>>>>
>>>>>>> This exemplifies both use cases of a streaming SDF that emits
>>>>>>> infinite output for every input:
>>>>>>> - Applying it to a finite set of inputs (in this case to the result
>>>>>>> of reading a text file)
>>>>>>> - Applying it to an infinite set of inputs (i.e. having an unbounded
>>>>>>> number of streams being read concurrently, each of the streams 
>>>>>>> themselves
>>>>>>> is unbounded too)
>>>>>>>
>>>>>>> Does the multi-level solution you have in mind work for this case? I
>>>>>>> suppose the second case is harder, so we can focus on that.
>>>>>>>
>>>>>> So none of those are a splittabledofn right?
>>>>>>
>>>>> Not sure what you mean? ReadFromKafkaFn in these examples is a
>>>>> splittable DoFn and we're trying to figure out how to make Spark run it.
>>>>>
>>>>>
>>>> Ah ok, sorry I saw that and for some reason parsed them as old style
>>>> DoFns in my head.
>>>>
>>>> To effectively allow us to union back into the “same” DStream  we’d
>>>> have to end up using Sparks queue streams (or their equivalent custom
>>>> source because of some queue stream limitations), which invites some
>>>> reliability challenges. This might

Re: Splittable DoFN in Spark discussion

2018-03-24 Thread Holden Karau
On Sat, Mar 24, 2018 at 1:23 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

>
>
> On Fri, Mar 23, 2018, 11:17 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>>>>> wrote:
>>>>>
>>>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <
>>>>>> kirpic...@google.com> wrote:
>>>>>>
>>>>>>> Reviving this thread. I think SDF is a pretty big risk for Spark
>>>>>>> runner streaming. Holden, is it correct that Spark appears to have no 
>>>>>>> way
>>>>>>> at all to produce an infinite DStream from a finite RDD? Maybe we can
>>>>>>> somehow dynamically create a new DStream for every initial restriction,
>>>>>>> said DStream being obtained using a Receiver that under the hood 
>>>>>>> actually
>>>>>>> runs the SDF? (this is of course less efficient than a timer-capable 
>>>>>>> runner
>>>>>>> would do, and I have doubts about the fault tolerance)
>>>>>>>
>>>>>> So on the streaming side we could simply do it with a fixed number of
>>>>>> levels on DStreams. It’s not great but it would work.
>>>>>>
>>>>> Not sure I understand this. Let me try to clarify what SDF demands of
>>>>> the runner. Imagine the following case: a file contains a list of "master"
>>>>> Kafka topics, on which there are published additional Kafka topics to 
>>>>> read.
>>>>>
>>>>> PCollection masterTopics = TextIO.read().from(masterTopicsFile)
>>>>> PCollection nestedTopics =
>>>>> masterTopics.apply(ParDo(ReadFromKafkaFn))
>>>>> PCollection records =
>>>>> nestedTopics.apply(ParDo(ReadFromKafkaFn))
>>>>>
>>>>> This exemplifies both use cases of a streaming SDF that emits infinite
>>>>> output for every input:
>>>>> - Applying it to a finite set of inputs (in this case to the result of
>>>>> reading a text file)
>>>>> - Applying it to an infinite set of inputs (i.e. having an unbounded
>>>>> number of streams being read concurrently, each of the streams themselves
>>>>> is unbounded too)
>>>>>
>>>>> Does the multi-level solution you have in mind work for this case? I
>>>>> suppose the second case is harder, so we can focus on that.
>>>>>
>>>> So none of those are a splittabledofn right?
>>>>
>>> Not sure what you mean? ReadFromKafkaFn in these examples is a
>>> splittable DoFn and we're trying to figure out how to make Spark run it.
>>>
>>>
>> Ah ok, sorry I saw that and for some reason parsed them as old style
>> DoFns in my head.
>>
>> To effectively allow us to union back into the “same” DStream  we’d have
>> to end up using Sparks queue streams (or their equivalent custom source
>> because of some queue stream limitations), which invites some reliability
>> challenges. This might be at the point where I should send a diagram/some
>> sample code since it’s a bit convoluted.
>>
>> The more I think about the jumps required to make the “simple” union
>> approach work, the more it seems just using the statemapping for steaming
>> is probably more reasonable. Although the state tracking in Spark can be
>> somewhat expensive so it would probably make sense to benchmark to see if
>> it meets our needs.
>>
> So the problem is, I don't think this can be made to work using
> mapWithState. It doesn't allow a mapping function that emits infinite
> output for an input element, directly or not.
>
So, provided there is an infinite input (eg pick a never ending queue
stream), and each call produces a finite output, we would have an infinite
number of calls.

>
> Dataflow and Flink, for example, had timer support even before SDFs, and a
> timer can set another timer and thus end up doing an infinite amount of
> work in a fault tolerant way - so SDF could be implemented on top of that.
> But AFAIK spark doe

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Holden Karau
On Fri, Mar 23, 2018 at 7:00 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> On Fri, Mar 23, 2018 at 6:49 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca>
>>> wrote:
>>>
>>>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com>
>>>> wrote:
>>>>
>>>>> Reviving this thread. I think SDF is a pretty big risk for Spark
>>>>> runner streaming. Holden, is it correct that Spark appears to have no way
>>>>> at all to produce an infinite DStream from a finite RDD? Maybe we can
>>>>> somehow dynamically create a new DStream for every initial restriction,
>>>>> said DStream being obtained using a Receiver that under the hood actually
>>>>> runs the SDF? (this is of course less efficient than a timer-capable 
>>>>> runner
>>>>> would do, and I have doubts about the fault tolerance)
>>>>>
>>>> So on the streaming side we could simply do it with a fixed number of
>>>> levels on DStreams. It’s not great but it would work.
>>>>
>>> Not sure I understand this. Let me try to clarify what SDF demands of
>>> the runner. Imagine the following case: a file contains a list of "master"
>>> Kafka topics, on which there are published additional Kafka topics to read.
>>>
>>> PCollection masterTopics = TextIO.read().from(masterTopicsFile)
>>> PCollection nestedTopics =
>>> masterTopics.apply(ParDo(ReadFromKafkaFn))
>>> PCollection records = nestedTopics.apply(ParDo(ReadFromKafkaFn))
>>>
>>> This exemplifies both use cases of a streaming SDF that emits infinite
>>> output for every input:
>>> - Applying it to a finite set of inputs (in this case to the result of
>>> reading a text file)
>>> - Applying it to an infinite set of inputs (i.e. having an unbounded
>>> number of streams being read concurrently, each of the streams themselves
>>> is unbounded too)
>>>
>>> Does the multi-level solution you have in mind work for this case? I
>>> suppose the second case is harder, so we can focus on that.
>>>
>> So none of those are a splittabledofn right?
>>
> Not sure what you mean? ReadFromKafkaFn in these examples is a splittable
> DoFn and we're trying to figure out how to make Spark run it.
>
>
Ah ok, sorry I saw that and for some reason parsed them as old style DoFns
in my head.

To effectively allow us to union back into the “same” DStream  we’d have to
end up using Sparks queue streams (or their equivalent custom source
because of some queue stream limitations), which invites some reliability
challenges. This might be at the point where I should send a diagram/some
sample code since it’s a bit convoluted.

The more I think about the jumps required to make the “simple” union
approach work, the more it seems just using the statemapping for steaming
is probably more reasonable. Although the state tracking in Spark can be
somewhat expensive so it would probably make sense to benchmark to see if
it meets our needs.

But these still are both DStream based rather than Dataset which we might
want to support (depends on what direction folks take with the runners).

If we wanted to do this in the dataset world looking at a custom
sink/source would also be an option, (which is effectively what a custom
queue stream like thing for dstreams requires), but the datasource APIs are
a bit influx so if we ended up doing things at the edge of what’s allowed
there’s a good chance we’d have to rewrite it a few times.


>> Assuming that we have a given dstream though in Spark we can get the
>> underlying RDD implementation for each microbatch and do our work inside of
>> that.
>>
>>>
>>>
>>>>
>>>> More generally this does raise an important question if we want to
>>>> target datasets instead of rdds/DStreams in which case i would need to do
>>>> some more poking.
>>>>
>>>>
>>>>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote:
>>>>>
>>>>>> How would timers be implemented? By outputing and reprocessing, the
>>>>>> same way you proposed for SDF?
>>>>>>
>>>>> i mean the timers could be inside the mappers within the system. Could
>>>> use a singleton so if a partition is re-executed it doesn’t end up as a
>>>> straggler.
>>

Re: Splittable DoFN in Spark discussion

2018-03-23 Thread Holden Karau
On Fri, Mar 23, 2018 at 6:20 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> On Fri, Mar 23, 2018 at 6:12 PM Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> On Fri, Mar 23, 2018 at 5:58 PM Eugene Kirpichov <kirpic...@google.com>
>> wrote:
>>
>>> Reviving this thread. I think SDF is a pretty big risk for Spark runner
>>> streaming. Holden, is it correct that Spark appears to have no way at all
>>> to produce an infinite DStream from a finite RDD? Maybe we can somehow
>>> dynamically create a new DStream for every initial restriction, said
>>> DStream being obtained using a Receiver that under the hood actually runs
>>> the SDF? (this is of course less efficient than a timer-capable runner
>>> would do, and I have doubts about the fault tolerance)
>>>
>> So on the streaming side we could simply do it with a fixed number of
>> levels on DStreams. It’s not great but it would work.
>>
> Not sure I understand this. Let me try to clarify what SDF demands of the
> runner. Imagine the following case: a file contains a list of "master"
> Kafka topics, on which there are published additional Kafka topics to read.
>
> PCollection masterTopics = TextIO.read().from(masterTopicsFile)
> PCollection nestedTopics =
> masterTopics.apply(ParDo(ReadFromKafkaFn))
> PCollection records = nestedTopics.apply(ParDo(ReadFromKafkaFn))
>
> This exemplifies both use cases of a streaming SDF that emits infinite
> output for every input:
> - Applying it to a finite set of inputs (in this case to the result of
> reading a text file)
> - Applying it to an infinite set of inputs (i.e. having an unbounded
> number of streams being read concurrently, each of the streams themselves
> is unbounded too)
>
> Does the multi-level solution you have in mind work for this case? I
> suppose the second case is harder, so we can focus on that.
>
So none of those are a splittabledofn right?

Assuming that we have a given dstream though in Spark we can get the
underlying RDD implementation for each microbatch and do our work inside of
that.

>
>
>>
>> More generally this does raise an important question if we want to target
>> datasets instead of rdds/DStreams in which case i would need to do some
>> more poking.
>>
>>
>>> On Wed, Mar 14, 2018 at 10:26 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> How would timers be implemented? By outputing and reprocessing, the
>>>> same way you proposed for SDF?
>>>>
>>> i mean the timers could be inside the mappers within the system. Could
>> use a singleton so if a partition is re-executed it doesn’t end up as a
>> straggler.
>>
>>>
>>>>
>>>> On Wed, Mar 14, 2018 at 7:25 PM Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> So the timers would have to be in our own code.
>>>>>
>>>>> On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <kirpic...@google.com>
>>>>> wrote:
>>>>>
>>>>>> Does Spark have support for timers? (I know it has support for state)
>>>>>>
>>>>>> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote:
>>>>>>
>>>>>>> Could we alternatively use a state mapping function to keep track of
>>>>>>> the computation so far instead of outputting V each time? (also the
>>>>>>> progress so far is probably of a different type R rather than V).
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> So we had a quick chat about what it would take to add something
>>>>>>>> like SplittableDoFns to Spark. I'd done some sketchy thinking about 
>>>>>>>> this
>>>>>>>> last year but didn't get very far.
>>>>>>>>
>>>>>>>> My back-of-the-envelope design was as follows:
>>>>>>>> For input type T
>>>>>>>> Output type V
>>>>>>>>
>>>>>>>> Implement a mapper which outputs type (T, V)
>>>>>>>> and if the computation finishes T will be populated otherwise V
>>>>>>>> will be
>>>>>>>>
>>>>>>>> For determining how long to run we'd up to either K seconds or
>>>>>>>> listen for a signal on a port
>>>>>>>>
>>>>>>>> Once we're done running we take the result and filter for the ones
>>>>>>>> with T and V into seperate collections re-run until finished
>>>>>>>> and then union the results
>>>>>>>>
>>>>>>>>
>>>>>>>> This is maybe not a great design but it was minimally complicated
>>>>>>>> and I figured terrible was a good place to start and improve from.
>>>>>>>>
>>>>>>>>
>>>>>>>> Let me know your thoughts, especially the parts where this is worse
>>>>>>>> than I remember because its been awhile since I thought about this.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>
>>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>> --
>> Twitter: https://twitter.com/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: Splittable DoFN in Spark discussion

2018-03-14 Thread Holden Karau
So the timers would have to be in our own code.

On Wed, Mar 14, 2018 at 5:18 PM Eugene Kirpichov <kirpic...@google.com>
wrote:

> Does Spark have support for timers? (I know it has support for state)
>
> On Wed, Mar 14, 2018 at 4:43 PM Reuven Lax <re...@google.com> wrote:
>
>> Could we alternatively use a state mapping function to keep track of the
>> computation so far instead of outputting V each time? (also the progress so
>> far is probably of a different type R rather than V).
>>
>>
>> On Wed, Mar 14, 2018 at 4:28 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>>
>>> So we had a quick chat about what it would take to add something like
>>> SplittableDoFns to Spark. I'd done some sketchy thinking about this last
>>> year but didn't get very far.
>>>
>>> My back-of-the-envelope design was as follows:
>>> For input type T
>>> Output type V
>>>
>>> Implement a mapper which outputs type (T, V)
>>> and if the computation finishes T will be populated otherwise V will be
>>>
>>> For determining how long to run we'd up to either K seconds or listen
>>> for a signal on a port
>>>
>>> Once we're done running we take the result and filter for the ones with
>>> T and V into seperate collections re-run until finished
>>> and then union the results
>>>
>>>
>>> This is maybe not a great design but it was minimally complicated and I
>>> figured terrible was a good place to start and improve from.
>>>
>>>
>>> Let me know your thoughts, especially the parts where this is worse than
>>> I remember because its been awhile since I thought about this.
>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau


Splittable DoFN in Spark discussion

2018-03-14 Thread Holden Karau
So we had a quick chat about what it would take to add something like
SplittableDoFns to Spark. I'd done some sketchy thinking about this last
year but didn't get very far.

My back-of-the-envelope design was as follows:
For input type T
Output type V

Implement a mapper which outputs type (T, V)
and if the computation finishes T will be populated otherwise V will be

For determining how long to run we'd up to either K seconds or listen for a
signal on a port

Once we're done running we take the result and filter for the ones with T
and V into seperate collections re-run until finished
and then union the results


This is maybe not a great design but it was minimally complicated and I
figured terrible was a good place to start and improve from.


Let me know your thoughts, especially the parts where this is worse than I
remember because its been awhile since I thought about this.


-- 
Twitter: https://twitter.com/holdenkarau


Re: Merging Python code? Help avoid Python 3 regressions with these two simple steps :)

2018-03-02 Thread Holden Karau
3.4.3 is from Feb 2015, and for what it’s worth the minimum version of
Python in Spark is  3.4. We could enable lint tests in Jenkins and see how
they go?

On Fri, Mar 2, 2018 at 10:06 AM Alan Myrvold <amyrv...@google.com> wrote:

> I ran "python3 --version" on each worker and all showed python 3.4.3. Is
> that too old?
>
>
> On Fri, Mar 2, 2018 at 10:04 AM Ahmet Altay <al...@google.com> wrote:
>
>> That is my understanding as well, it is requires attention from infra.
>> Could anyone help with this? I know we worked with infra before, what is
>> the best way to approach this?
>>
>> On Fri, Mar 2, 2018 at 9:50 AM, Holden Karau <holden.ka...@gmail.com>
>> wrote:
>>
>>> I agree, however I'm of the impression it's blocked on infra? (e.g. it's
>>> important but out of my hands).
>>>
>>> On Mar 1, 2018 11:05 PM, "Ahmet Altay" <al...@google.com> wrote:
>>>
>>>> I think we should prioritize the issue of installing Python 3 on the
>>>> workers (https://issues.apache.org/jira/browse/BEAM-3671). I would
>>>> appreciate if folks pay attention to these 2 steps but I am worried that it
>>>> will be easily forgotten.
>>>>
>>>> On Thu, Mar 1, 2018 at 6:51 PM, Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>>
>>>>> I may have watched too many buzzfeed videos this week but the steps
>>>>> are:
>>>>> 1) git checkout the PR in question
>>>>> 2) Run tox -e lint_py2,lint_py3
>>>>>
>>>>> This is important since Python 3 isn't installed on the Jenkins
>>>>> workers just yet and we have some tests to catch basic invalid Python 3
>>>>> which we can slowly grow as we fix the issues and you can help us keep
>>>>> moving forward!
>>>>>
>>>>> If step 1 is too much work I like using the hub program I find it
>>>>> helps me with this part of my workflow in other projects. That being said
>>>>> you don't have to do this, we'll fix whatever errors come up, so if this 
>>>>> is
>>>>> going to slow your workflow down or you otherwise don't like it feel free
>>>>> to pass along.
>>>>>
>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>
>>>>
>>>>
>> --
Twitter: https://twitter.com/holdenkarau


Re: Merging Python code? Help avoid Python 3 regressions with these two simple steps :)

2018-03-02 Thread Holden Karau
I agree, however I'm of the impression it's blocked on infra? (e.g. it's
important but out of my hands).

On Mar 1, 2018 11:05 PM, "Ahmet Altay" <al...@google.com> wrote:

> I think we should prioritize the issue of installing Python 3 on the
> workers (https://issues.apache.org/jira/browse/BEAM-3671). I would
> appreciate if folks pay attention to these 2 steps but I am worried that it
> will be easily forgotten.
>
> On Thu, Mar 1, 2018 at 6:51 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
>> I may have watched too many buzzfeed videos this week but the steps are:
>> 1) git checkout the PR in question
>> 2) Run tox -e lint_py2,lint_py3
>>
>> This is important since Python 3 isn't installed on the Jenkins workers
>> just yet and we have some tests to catch basic invalid Python 3 which we
>> can slowly grow as we fix the issues and you can help us keep moving
>> forward!
>>
>> If step 1 is too much work I like using the hub program I find it helps
>> me with this part of my workflow in other projects. That being said you
>> don't have to do this, we'll fix whatever errors come up, so if this is
>> going to slow your workflow down or you otherwise don't like it feel free
>> to pass along.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Merging Python code? Help avoid Python 3 regressions with these two simple steps :)

2018-03-01 Thread Holden Karau
I may have watched too many buzzfeed videos this week but the steps are:
1) git checkout the PR in question
2) Run tox -e lint_py2,lint_py3

This is important since Python 3 isn't installed on the Jenkins workers
just yet and we have some tests to catch basic invalid Python 3 which we
can slowly grow as we fix the issues and you can help us keep moving
forward!

If step 1 is too much work I like using the hub program I find it helps me
with this part of my workflow in other projects. That being said you don't
have to do this, we'll fix whatever errors come up, so if this is going to
slow your workflow down or you otherwise don't like it feel free to pass
along.

-- 
Twitter: https://twitter.com/holdenkarau


Re: Python 3 flake 8: splitting up on the errors?

2018-02-28 Thread Holden Karau
Awesome, I'll get that kicked off in JIRA

On Feb 28, 2018 2:43 PM, "Ahmet Altay" <al...@google.com> wrote:

> I think this is a great idea. I would encourage everyone who would like to
> help with Python 3 migration to help with this effort. Holden, if you
> already have a list, could you either share the list or create individual
> JIRAs so that we can track the work among us.
>
> On Tue, Feb 27, 2018 at 4:53 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
>> How would folks feel about splitting up some of the Python 3 migration
>> work by the different flake8 errors in Py3? This might allow us to
>> parallelize some of the work while still keeping things fairly small?
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


Python 3 flake 8: splitting up on the errors?

2018-02-27 Thread Holden Karau
How would folks feel about splitting up some of the Python 3 migration work
by the different flake8 errors in Py3? This might allow us to parallelize
some of the work while still keeping things fairly small?

-- 
Twitter: https://twitter.com/holdenkarau


Re: Python 3 reviewers

2018-02-22 Thread Holden Karau
So I've been hesistant to do large chunks of Python 3 reviews for the same
reason I've been hesistant to do large chunks of Python 3 work (on the
commit side seems to move more slowly than the devs), but I suppose that
does become somewhat of a self fulfilling prophecy. I'll dedicate some more
cycles to helping with the reviews :)

On Fri, Feb 23, 2018 at 1:49 PM, Ahmet Altay <al...@google.com> wrote:

> Thank you Holden for doing this work. I agree with Robert's comment. I
> know there are a few folks working on this now (you, @luke-zhu and
> @cclauss). Perhaps you could do python 3 related code reviews within that
> group. I would be happy to chime in and review some chunks as well.
>
> On Thu, Feb 22, 2018 at 4:54 PM, Robert Bradshaw <rober...@google.com>
> wrote:
>
>> I'd really like to see Python 3 support sooner rather than later, and
>> have been reviewing some (simple) PRs in this direction. As long as
>> they're broken up into small enough chunks, feel free to send some my
>> way.
>>
>> On Thu, Feb 22, 2018 at 3:59 PM, Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>> > Hi Y'all,
>> >
>> > I'm trying to make some progress on Python 3 support for Beam but I'm
>> having
>> > a bit of difficulty finding people with review bandwidth. Are there any
>> > committers with time to spare who would be willing to work on this? If
>> not
>> > no worries I'll refocus my efforts elsewhere :)
>> >
>> > Cheers,
>> >
>> > Holden :)
>> >
>> > --
>> > Twitter: https://twitter.com/holdenkarau
>>
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Python 3 reviewers

2018-02-22 Thread Holden Karau
Hi Y'all,

I'm trying to make some progress on Python 3 support for Beam but I'm
having a bit of difficulty finding people with review bandwidth. Are there
any committers with time to spare who would be willing to work on this? If
not no worries I'll refocus my efforts elsewhere :)

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: [PROPOSAL] Switch from Guava futures vs Java 8 futures

2018-02-02 Thread Holden Karau
For what it's worth there exists a relatively easy Java8 to Scala future
conversion so this shouldn't cause an issue on the Spark runner.

On Thu, Feb 1, 2018 at 11:22 PM, Alexey Romanenko 
wrote:

> +1, sounds great!
>
> Regards,
> Alexey
>
>
> On 2 Feb 2018, at 07:14, Thomas Weise  wrote:
>
> +1
>
>
> On Thu, Feb 1, 2018 at 9:07 PM, Jean-Baptiste Onofré 
> wrote:
>
>> +1
>>
>> Regards
>> JB
>>
>> On 02/01/2018 07:54 PM, Kenneth Knowles wrote:
>> > Hi all,
>> >
>> > Luke, Thomas, and I had some in-person discussions about the use of
>> Java 8
>> > futures and Guava futures in the portability support code. I wanted to
>> bring our
>> > thoughts to the dev list for feedback.
>> >
>> > As background:
>> >
>> >  - Java 5+ "Future" lacks the main purpose of future, which is async
>> chaining.
>> >  - Guava introduced ListenableFuture to do real future-oriented
>> programming
>> >  - Java 8 added CompletionStage which is more-or-less the expected
>> interface
>> >
>> > It is still debatable whether Java got it right [1]. But since it is
>> > standardized, doesn't need to be shaded, etc, it is worth trying to
>> just use it
>> > carefully in the right ways. So we thought to propose that we migrate
>> most uses
>> > of Guava futures to Java 8 futures.
>> >
>> > What do you think? Have we missed an important problem that would make
>> this a
>> > deal-breaker?
>> >
>> > Kenn
>> >
>> > [1]
>> > e.g. https://stackoverflow.com/questions/38744943/listenable
>> future-vs-completablefuture#comment72041244_39250452
>> > and such discussions are likely to occur whenever you bring it up with
>> someone
>> > who cares a lot about futures :-)
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>
>
>


-- 
Twitter: https://twitter.com/holdenkarau


FOSDEM mini office hour?

2018-01-31 Thread Holden Karau
Hi BEAM Friends,

If any folks are around for FOSDEM this year I was planning on doing a
coffee office hour on the last day after my talks
. Maybe like 6pm?
I'm also going to see if any Spark folks are around and interested :)

Cheers,

Holden :)


-- 
Twitter: https://twitter.com/holdenkarau


Re: Strata Conference this March 6-8

2018-01-16 Thread Holden Karau
How would folks feel about during the afternoon break (3:20-4:20) on the
Wednesday (same day as Eugene's talk)? We could do the Philz which is a bit
of a walk but gets us away from the big crowd and also lets folks not
attending the conference but in the area join us.

On Tue, Jan 16, 2018 at 5:29 PM, Ron Gonzalez <zlgonza...@yahoo.com> wrote:

> Cool, let me know if you guys finally schedule it. I will definitely try
> to make it to Eugene's talk but having an informal BoF in the area would be
> nice...
>
> Thanks,
> Ron
>
> On Tuesday, January 16, 2018, 5:06:53 PM PST, Boris Lublinsky <
> boris.lublin...@lightbend.com> wrote:
>
>
> All for it
>
> Boris Lublinsky
> FDP Architect
> boris.lublin...@lightbend.com
> https://www.lightbend.com/
>
> On Jan 16, 2018, at 7:01 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>
> +1 to BoF
>
> On Tue, Jan 16, 2018 at 5:00 PM, Dmitry Demeshchuk <dmi...@postmates.com>
> wrote:
>
> Probably won't be attending the conference, but totally down for a BoF.
>
> On Tue, Jan 16, 2018 at 4:58 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
> Do interested folks have any timing constraints around a BoF?
>
> On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson <je...@bigdatainstitute.io
> > wrote:
>
> +1 to BoF. I don't know if any Beam talks will be on the schedule.
>
> > We could do an informal BoF at the Philz nearby or similar?
>
>
>
>
> --
> Twitter: https://twitter.com/h oldenkarau
> <https://twitter.com/holdenkarau>
>
>
>
>
> --
> Best regards,
> Dmitry Demeshchuk.
>
>
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Strata Conference this March 6-8

2018-01-16 Thread Holden Karau
Do interested folks have any timing constraints around a BoF?

On Tue, Jan 16, 2018 at 4:30 PM, Jesse Anderson 
wrote:

> +1 to BoF. I don't know if any Beam talks will be on the schedule.
>
> > We could do an informal BoF at the Philz nearby or similar?
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Strata Conference this March 6-8

2018-01-16 Thread Holden Karau
We could do an informal BoF at the Philz nearby or similar?

On Wed, Jan 17, 2018 at 11:23 AM Eugene Kirpichov 
wrote:

> I'm giving a talk about splittable DoFn's
> https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63696?locale=zh
>
> There are no other talks with the word "Beam" in the title, unless I
> missed something.
>
> On Tue, Jan 16, 2018 at 4:11 PM Ron Gonzalez  wrote:
>
>> Hi,
>>   Will there be some talks or representation of Apache Beam at the coming
>> Strata Conference this March 6-8?
>>   Would be great to hear someone talk about how Beam's been used at their
>> company as their core data integration platform.
>>
>> Thanks,
>> Ron
>>
>>
> --
Twitter: https://twitter.com/holdenkarau


Re: Pushing daily/test containers for python

2017-12-21 Thread Holden Karau
So I think we (or more accurately the PMC) need to be careful with how we
post the container artifacts from an Apache POV since they most likely
contain non-Apache licensed code (and also posting daileys can be
conolicated since the PMC hasn’t voted on each one).

For just testing it should probably be OK but we need to make sure users
aren’t confused and think they are releases.


On Thu, Dec 21, 2017 at 10:03 AM Valentyn Tymofieiev 
wrote:

> The GCR repository can be configured with public pull access, which I
> think will be required to use the container.
>
> On Thu, Dec 21, 2017 at 2:34 AM, David Sabater Dinter <
> david.saba...@gmail.com> wrote:
>
>> +1
>> Hi,
>> It makes sense to use GCR (locality with GCP services and works like any
>> other container repository), only caveat being that the images will be
>> private, in case anyone requires to debug locally will need access to pull
>> the image or build locally and push.
>> I agree getting closer to (a) is preferable assuming the build time
>> doesn't increase dramatically in the post commit process.
>>
>> On Thu, Dec 21, 2017 at 1:59 AM Henning Rohde  wrote:
>>
>>> +1
>>>
>>> It would be great to be able to test this aspect of portability. For
>>> testing purposes, I think whatever container registry is convenient to use
>>> for distribution is fine.
>>>
>>> Regarding frequency, I think we should consider something closer to (a).
>>> The container images -- although usually quite stable -- are part of the
>>> SDK at that commit and are not guaranteed to work with any other version.
>>> Breaking changes in their interaction would cause confusion and create
>>> noise. Any local tests can also in theory just build the container images
>>> directly and not use any registry, so it might make sense to set up the
>>> tests so that pushing occurs less frequently then building.
>>>
>>> Henning
>>>
>>>
>>>
>>> On Wed, Dec 20, 2017 at 3:10 PM, Ahmet Altay  wrote:
>>>
 Hi all,

 After some recent changes (e.g. [1]) we have a feasible container that
 we can use to test Python SDK on portability framework. Until now we were
 using Google provided container images for testing and for the released
 product. We can gradually move away from that (at least partially) for
 Python SDK.

 I would like to propose building containers for testing purposes only
 and pushing them to gcr.io as part of jenkins jobs. I would like to
 clarify two points with the team first:

 1. Use of GCR, I am proposing it for a few reasons:
 - Beam's jenkins workers run on GCP, and it would be easy to push them
 to gcr from there.
 - If we use another service (perhaps with a free tier for open source
 projects) we might be overusing it by pushing/pulling from our daily tests.
 - This is similar to how we stage some artifacts to GCS as part of the
 testing process.

 2. Frequency of building and pushing containers

 a. We can run it at every PR, by integrating with python post commit
 tests.
 b. We can run it daily, by having a new Jenkins job.
 c. We can run it manually, by having a parameterized Jenkins job that
 can build and push a new container from a tag/commit. Given that we
 infrequently change container code, I would suggest choosing this option.

 What do you think about this? To be clear, this is just a proposal
 about the testing environment. I am not suggesting anything about the
 release artifacts.

 Thank you,
 Ahmet

 [1] https://github.com/apache/beam/pull/4286

>>>
>>>
> --
Twitter: https://twitter.com/holdenkarau


Re: Introduction + interest in helping Beam builds, tests, and releases

2017-12-07 Thread Holden Karau
Also, and I know this is maybe a bit beyond the scope of what would make
sense initially, but if you wanted to set up something to test BEAM against
the new Spark/Flink RCs we could give feedback about any breaking changes
we see in upstream projects and I’d be happy to help with that :)

On Fri, Dec 8, 2017 at 9:44 AM Eugene Kirpichov 
wrote:

> Awesome, excited to see release validation automated! Please let me know
> if you need help getting Flink and Spark runner validation on Dataproc - I
> did that manually and it involved some non-obvious steps.
>
> On Thu, Dec 7, 2017 at 5:29 PM Alan Myrvold  wrote:
>
>> Hi, I'm Alan.
>>
>> I've been working with Google Cloud engineering productivity, and I'm
>> keen on improving the Beam release process and build/test infrastructure.
>>
>> I will be first looking into scripting some of the release validation
>> steps for the nightly java snapshot releases, but hope to learn and improve
>> the whole development and testing experience for Beam.
>>
>> Look forward to working with everyone!
>>
>> Alan Myrvold
>>
>> --
Twitter: https://twitter.com/holdenkarau


Re: Schema-Aware PCollections

2017-11-30 Thread Holden Karau
Rocking, I'll start leaving some comments on this. I'm excited to see work
being done in this area as well :)

On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau <taki...@google.com> wrote:

> On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax <re...@google.com> wrote:
>
>> There has been a lot of conversation about schemas on PCollections
>> recently. There are a number of reasons for this. Schemas as first-class
>> objects in Beam provide a nice base for building BeamSQL. Spark has
>> provided schema-support via Dataframes for over two years, and it has
>> proved to be very popular among Spark users; it turns out that FlumeJava -
>> the original inspiration for the Beam API - has had schema support for even
>> longer, though this feature was not included in the Beam (at that time
>> Dataflow) API. It turns out that most records have structure, and allowing
>> the system to understand record structure can both simplify usage of the
>> system and allow for new performance optimizations.
>>
>> After discussion with JB, Eugene, Kenn, Robert, and a number of others on
>> the list, I've started a proposal document here
>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing>
>> describing how schemas can be added to Beam in a manner that integrates
>> with the existing Beam API. The goal is not blindly copy existing systems
>> that have schemas, but rather to ensure that we get the best fit for Beam.
>> Please comment on this proposal - as much feedback as possible is valuable.
>>
>> In addition, you may notice this document is incomplete. While it does
>> sketch out how schemas can fit into Beam semantically, many portions of
>> this design remain to be fleshed out. In particular, the API signatures are
>> only sketched at at a high level, exactly what all these APIs will look
>> like has not yet been defined. I would welcome help from interested members
>> of the community to define these APIs, and to make sure we're covering all
>> relevant use cases.
>>
>
> Thanks for sharing this Reuven, I'm excited to see this being discussed.
> One global comment: all of the existing examples are in Java. It would be
> great if we could design this with Python in mind (and how it could
> interact cleanly with Pandas) at the same time. +Robert Bradshaw
> <rober...@google.com> , +Holden Karau <hka...@google.com> , and +Ahmet
> Altay <al...@google.com> , all whom I've spoken with regarding this and
> other Python things recently, just to be sure they see it. But of course
> it'd be great if anyone working on Python could jump in.
>
> -Tyler
>
>
>
>>
>> Thanks all,
>>
>> Reuven
>>
>>
>>


-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Fixing @yyy.com.INVALID mailing addresses

2017-11-22 Thread Holden Karau
+1 (non-binding)

On Wed, Nov 22, 2017 at 4:06 PM Kenneth Knowles 
wrote:

> +1
>
> On Wed, Nov 22, 2017 at 3:43 PM, Lukasz Cwik 
> wrote:
>
> > +1
> >
> > On Wed, Nov 22, 2017 at 3:35 PM, Reuven Lax 
> > wrote:
> >
> > > +1
> > >
> > > On Nov 22, 2017 3:29 PM, "Ben Sidhom" 
> wrote:
> > >
> > > > I'm not a PMC member, but this would be especially valuable if it
> > > > propagated DKIM signatures properly.
> > > >
> > > > On Wed, Nov 22, 2017 at 3:25 PM, Lukasz Cwik
>  > >
> > > > wrote:
> > > >
> > > > > I have noticed that some e-mail addresses (notably @google.com)
> get
> > > > > .INVALID suffixed onto it so per...@yyy.com become
> > > > per...@yyy.com.INVALID
> > > > > in the From: header.
> > > > >
> > > > > I have figured out that this is an issue with the way that our mail
> > > > server
> > > > > is configured and opened https://issues.apache.org/
> > > > jira/browse/INFRA-15529
> > > > > .
> > > > >
> > > > > For those of us that are impacted, it makes it more difficult for
> > users
> > > > to
> > > > > reply directly to the originator.
> > > > >
> > > > > Infra has asked to get consensus from PMC members before making the
> > > > change
> > > > > which I figured it would be easiest with a vote.
> > > > >
> > > > > Please vote:
> > > > > +1 Update mail server to stop suffixing .INVALID
> > > > > -1 Don't change mail server settings.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > -Ben
> > > >
> > >
> >
>
-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Choose the "new" Spark runner

2017-11-20 Thread Holden Karau
[ ] Use Spark 1 & Spark 2 Support Branch
[ X ] Use Spark 2 Only Branch

non-binding

On Mon, Nov 20, 2017 at 1:00 AM, Etienne Chauchot 
wrote:

> [ ] Use Spark 1 & Spark 2 Support Branch
> [X] Use Spark 2 Only Branch
>
> Best
> Etienne
>
>
>
> Le 19/11/2017 à 13:56, Tyler Akidau a écrit :
>
>> [ ] Use Spark 1 & Spark 2 Support Branch
>> [X] Use Spark 2 Only Branch
>>
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: Questions with containerized runners plans?

2017-11-18 Thread Holden Karau
Cool, thanks!

It seems like some good follow ups might exist to simplify things for
Python users so they don’t have to roll their own docker files (like allow
them provide a requirements.txt which is used in the dockerfile) :)

I’m really excited about the direction with the containerized runners :)

On Sat, Nov 18, 2017 at 6:12 PM Henning Rohde <hero...@google.com.invalid>
wrote:

> A benefit of using docker containers is that (nearly) arbitrary native
> dependencies can be installed in the container image itself by either the
> user or SDK. For example, the (minimal, in progress) Python container
> Dockerfile is here:
>
>
>
> https://github.com/apache/beam/blob/1039f5b9682fa6aa5fba256110c63caf4d0da41f/sdks/python/container/Dockerfile
>
> Any user could simply augment it with "pip install" commands, say, or use
> something else entirely (although the corresponding boot program may also
> need to change in that case). The Python SDK itself might also include
> options/scripts/etc to make common customizations easier to use to avoid
> installing them at runtime. Multiple Dockerfiles can also co-exist. For
> actually passing the container image to the runner it's a choice make by
> each SDK, which is why it's not discussed much in the portability context.
> But a uniform flag along the lines of --sdk_harness_container_image to
> include the image into the pipeline proto would seem desirable. That said,
> I don't think how all these capabilities would best be exposed to users has
> been much explored yet in any SDK.
>
> Finally, there has been several thoughts on cross-language pipelines and I
> think it's a very exciting aspect of the portability framework. A doc is
> here:
>
>https://s.apache.org/beam-mixed-language-pipelines.
>
> It is also linked from design section in the portability page.
>
> Thanks,
>  Henning
>
>
> On Sat, Nov 18, 2017 at 6:33 AM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
> > So I was looking through https://beam.apache.org/contribute/portability/
> > which lead me to BEAM-2900, and then to
> > https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKW
> > FH9R6mtVmR7xp0/edit#
> > .
> >
> > I was wondering if there is any considerations being given to native
> > dependencies that user code may have (especially things like Python
> > packages which can be super painful to deal with in a Spark cluster
> unless
> > you use one of the vendor solutions)?
> >
> > Also, and this may be a terrible idea, but has there been thought given
> to
> > the idea of a cross-language pipelines (I see these in Spark occasionally
> > but with the DL stuff happening I suspect we might see users wanting
> > cross-language functionality more often)?
> >
> > I also saw "Proposal: introduce an option to pass SDK harness container
> > image in Beam SDKs" & it seems like Robert brought up the benefits of
> using
> > Docker for Python runners, but I don't see the details on how we would
> > expose that to users it in the design docs I've found yet (which could
> very
> > well be I'm not looking at the right docs).
> >
> > Cheers,
> >
> > Holden :)
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> >
>
-- 
Twitter: https://twitter.com/holdenkarau


Questions with containerized runners plans?

2017-11-18 Thread Holden Karau
So I was looking through https://beam.apache.org/contribute/portability/
which lead me to BEAM-2900, and then to
https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKWFH9R6mtVmR7xp0/edit#
.

I was wondering if there is any considerations being given to native
dependencies that user code may have (especially things like Python
packages which can be super painful to deal with in a Spark cluster unless
you use one of the vendor solutions)?

Also, and this may be a terrible idea, but has there been thought given to
the idea of a cross-language pipelines (I see these in Spark occasionally
but with the DL stuff happening I suspect we might see users wanting
cross-language functionality more often)?

I also saw "Proposal: introduce an option to pass SDK harness container
image in Beam SDKs" & it seems like Robert brought up the benefits of using
Docker for Python runners, but I don't see the details on how we would
expose that to users it in the design docs I've found yet (which could very
well be I'm not looking at the right docs).

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Holden Karau
That's a good point about Oozie does only supporting only Spark 1 or 2 at a
time on a cluster -- but do we know people using Oozie and Spark 1 that
would still be using Spark 1 by the time of the next BEAM release? The last
Spark 1 release was a year ago (and last non-maintenance release almost 20
months ago).

On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick <nerdyn...@gmail.com> wrote:

> I don't know if ditching Spark 1 out right right now would be a great move
> given that a lot of the main support applications around spark haven't yet
> fully moved to Spark 2 yet. Yet alone have support for having a cluster
> with both. Oozie for example is still pre stable release for their Spark 1
> and can't support a cluster with mixed Spark version. I think maybe doing
> as suggested above with the common, spark1, spark2 packaging might be best
> during this carry over phase. Maybe even just flag spark 1 as deprecated
> and just being maintained might be enough.
>
> On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
>
> > Also, upgrading Spark 1 to 2 is generally easier than changing JVM
> > versions. For folks using YARN or the hosted environments it pretty much
> > trivial since you can effectively have distinct Spark clusters for each
> > job.
> >
> > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau <hol...@pigscanfly.ca>
> wrote:
> >
> > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements in
> > > Spark 2, and trying to write efficient code that runs between Spark 1
> and
> > > Spark 2 is super painful in the long term. It would be one thing if
> there
> > > were a lot of people available to work on the Spark runners, but it
> seems
> > > like we'd be better spent focusing our energy on the future.
> > >
> > > I don't know a lot of folks who are stuck on Spark 1, and the few that
> I
> > > know are planning to migrate in the next few months anyways.
> > >
> > > Note: this is a non-binding vote as I'm not a committer or PMC member.
> > >
> > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu <yuzhih...@gmail.com> wrote:
> > >
> > >> Having both Spark1 and Spark2 modules would benefit wider user base.
> > >>
> > >> I would vote for that.
> > >>
> > >> Cheers
> > >>
> > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >> wrote:
> > >>
> > >> > Hi Robert,
> > >> >
> > >> > Thanks for your feedback !
> > >> >
> > >> > From an user perspective, with the current state of the PR, the same
> > >> > pipelines can run on both Spark 1.x and 2.x: the only difference is
> > the
> > >> > dependencies set.
> > >> >
> > >> > I'm calling the vote to get suck kind of feedback: if we consider
> > Spark
> > >> > 1.x still need to be supported, no problem, I will improve the PR to
> > >> have
> > >> > three modules (common, spark1, spark2) and let users pick the
> desired
> > >> > version.
> > >> >
> > >> > Let's wait a bit other feedbacks, I will update the PR accordingly.
> > >> >
> > >> > Regards
> > >> > JB
> > >> >
> > >> >
> > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
> > >> >
> > >> >> I'm generally a -0.5 on this change, or at least doing so hastily.
> > >> >>
> > >> >> As with dropping Java 7 support, I think this should at least be
> > >> >> announced in release notes that we're considering dropping support
> in
> > >> >> the subsequent release, as this dev list likely does not reach a
> > >> >> substantial portion of the userbase.
> > >> >>
> > >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
> > >> >> cluster? I get the feeling it's not nearly as transparent as
> > upgrading
> > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x
> clusters,
> > >> >> or is a new cluster (and/or upgrading all pipelines) required (e.g.
> > >> >> for those who operate spark clusters shared among their many
> users)?
> > >> >>
> > >> >> Looks like the latest release of Spark 1.x was about a year ago,
> > >> >> overlapping a bit with the 2.x series which is coming up 

Re: Portability overview webpage

2017-11-08 Thread Holden Karau
Awesome! Out of interest is there any discussion around common formats for
interchange going on?

On Tue, Nov 7, 2017 at 9:15 AM, Henning Rohde 
wrote:

> Thanks everyone! The page is now live at:
>
>https://beam.apache.org/contribute/portability/
>
> Henning
>
> On Thu, Nov 2, 2017 at 8:22 AM, Kenneth Knowles 
> wrote:
>
> > This is a superb high-level overview of the effort, understandable at a
> > glance. I think it is the first time someone has made it clear what we
> are
> > actually doing!
> >
> > Kenn
> >
> > On Wed, Nov 1, 2017 at 10:23 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > Thanks for the update. I will take a look.
> > >
> > > Regards
> > > JB
> > >
> > > On Nov 1, 2017, 18:21, at 18:21, Henning Rohde
> > 
> > > wrote:
> > > >Hi everyone,
> > > >
> > > >Although portability is a large and involved effort, it seems it
> > > >doesn't
> > > >have a high-level overview and plan written down anywhere. I added a
> > > >proposed page with a 10,000 ft view and links to the webside under
> > > >'Contribute (technical references)'. There is a page for ongoing
> > > >projects,
> > > >but portability is much more encompassing and seems to be more suited
> > > >for
> > > >it's own page.
> > > >
> > > >The PR is:
> > > >
> > > > https://github.com/apache/beam-site/pull/340
> > > >
> > > >I'm sending it out to the dev list for more visibility. Please let me
> > > >know
> > > >if you have any comments or objections, or if there is a better place
> > > >for
> > > >this content.
> > > >
> > > >Thanks,
> > > > Henning
> > >
> >
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Holden Karau
Also, upgrading Spark 1 to 2 is generally easier than changing JVM
versions. For folks using YARN or the hosted environments it pretty much
trivial since you can effectively have distinct Spark clusters for each job.

On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> I'm +1 on dropping Spark 1. There are a lot of exciting improvements in
> Spark 2, and trying to write efficient code that runs between Spark 1 and
> Spark 2 is super painful in the long term. It would be one thing if there
> were a lot of people available to work on the Spark runners, but it seems
> like we'd be better spent focusing our energy on the future.
>
> I don't know a lot of folks who are stuck on Spark 1, and the few that I
> know are planning to migrate in the next few months anyways.
>
> Note: this is a non-binding vote as I'm not a committer or PMC member.
>
> On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> Having both Spark1 and Spark2 modules would benefit wider user base.
>>
>> I would vote for that.
>>
>> Cheers
>>
>> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>> > Hi Robert,
>> >
>> > Thanks for your feedback !
>> >
>> > From an user perspective, with the current state of the PR, the same
>> > pipelines can run on both Spark 1.x and 2.x: the only difference is the
>> > dependencies set.
>> >
>> > I'm calling the vote to get suck kind of feedback: if we consider Spark
>> > 1.x still need to be supported, no problem, I will improve the PR to
>> have
>> > three modules (common, spark1, spark2) and let users pick the desired
>> > version.
>> >
>> > Let's wait a bit other feedbacks, I will update the PR accordingly.
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
>> >
>> >> I'm generally a -0.5 on this change, or at least doing so hastily.
>> >>
>> >> As with dropping Java 7 support, I think this should at least be
>> >> announced in release notes that we're considering dropping support in
>> >> the subsequent release, as this dev list likely does not reach a
>> >> substantial portion of the userbase.
>> >>
>> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
>> >> cluster? I get the feeling it's not nearly as transparent as upgrading
>> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters,
>> >> or is a new cluster (and/or upgrading all pipelines) required (e.g.
>> >> for those who operate spark clusters shared among their many users)?
>> >>
>> >> Looks like the latest release of Spark 1.x was about a year ago,
>> >> overlapping a bit with the 2.x series which is coming up on 1.5 years
>> >> old, so I could see a lot of people still using 1.x even if 2.x is
>> >> clearly the future. But it sure doesn't seem very backwards
>> >> compatible.
>> >>
>> >> Mostly I'm not comfortable with dropping 1.x in the same release as
>> >> adding support for 2.x, giving no transition period, but could be
>> >> convinced if this transition is mostly a no-op or no one's still using
>> >> 1.x. If there's non-trivial code complexity issues, I would perhaps
>> >> revisit the issue of having a single Spark Runner that does chooses
>> >> the backend implicitly in favor of simply having two runners which
>> >> share the code that's easy to share and diverge otherwise (which seems
>> >> it would be much simpler both to implement and explain to users). I
>> >> would be OK with even letting the Spark 1.x runner be somewhat
>> >> stagnant (e.g. few or no new features) until we decide we can kill it
>> >> off.
>> >>
>> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>> >
>> >> wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> as you might know, we are working on Spark 2.x support in the Spark
>> >>> runner.
>> >>>
>> >>> I'm working on a PR about that:
>> >>>
>> >>> https://github.com/apache/beam/pull/3808
>> >>>
>> >>> Today, we have something working with both Spark 1.x and 2.x from a
>> code
>> >>> standpoint, but I have to deal with dependencies. It's the first step
>> of
&

Re: python3 support schedule

2017-11-05 Thread Holden Karau
If anyone wants to help on the inference stuff:
https://issues.apache.org/jira/browse/BEAM-3143 + WIP PR @
https://github.com/apache/beam/pull/4079 .

On Sat, Nov 4, 2017 at 11:31 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

> I've got some work progress, although right now I'm stuck on handling the
> for comprehensions in Python 3 type inference (see https://github.com/
> holdenk/beam/tree/try-py3-by-hand )
>
> Personally I'll probably take a look into Python on other runners after
> this rather than Dataflow.
>
> On Sat, Nov 4, 2017 at 11:17 PM, Ahmet Altay <al...@google.com.invalid>
> wrote:
>
>> For reference https://issues.apache.org/jira/browse/BEAM-1251 is the
>> umbrella issue tracking python3 support in the core SDK. There needs to be
>> additional runner specific work (e.g. DataflowRunner needs to use python3
>> binary on its workers) once the core work is completed.
>>
>> Ahmet
>>
>> On Thu, Nov 2, 2017 at 12:58 PM, Jesse Anderson <
>> je...@bigdatainstitute.io>
>> wrote:
>>
>> > Holden is being modest in her contributions to Python frameworks,
>> > especially Apache Spark.
>> >
>> > On Thu, Nov 2, 2017 at 12:55 PM Holden Karau <hol...@pigscanfly.ca>
>> wrote:
>> >
>> > > Hi! So this is something I'm currently working on (e.g. in between
>> > checking
>> > > my e-mails :p). If you want to help join in we can split up the work
>> into
>> > > smaller components and parallelize the process a bit :) Always happy
>> to
>> > see
>> > > more folks who care about Python 3 support.
>> > >
>> > > On Thu, Nov 2, 2017 at 12:44 PM, Lukasz Cwik <lc...@google.com.invalid
>> >
>> > > wrote:
>> > >
>> > > > Contributions are always welcome to improve progress.
>> > > >
>> > > > You can always vote/watch the Python 3 JIRA issue as this helps
>> people
>> > > know
>> > > > what others are looking for.
>> > > >
>> > > > On Thu, Nov 2, 2017 at 10:33 AM, Yue Yang <shoomi...@gmail.com>
>> wrote:
>> > > >
>> > > > > Hello,
>> > > > >   I wonder what is the schedule to support python 3. It seems that
>> > the
>> > > > > progess is very slow.
>> > > > >   Thanks.
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Twitter: https://twitter.com/holdenkarau
>> > >
>> > --
>> > Thanks,
>> >
>> > Jesse
>> >
>>
>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: python3 support schedule

2017-11-05 Thread Holden Karau
I've got some work progress, although right now I'm stuck on handling the
for comprehensions in Python 3 type inference (see
https://github.com/holdenk/beam/tree/try-py3-by-hand )

Personally I'll probably take a look into Python on other runners after
this rather than Dataflow.

On Sat, Nov 4, 2017 at 11:17 PM, Ahmet Altay <al...@google.com.invalid>
wrote:

> For reference https://issues.apache.org/jira/browse/BEAM-1251 is the
> umbrella issue tracking python3 support in the core SDK. There needs to be
> additional runner specific work (e.g. DataflowRunner needs to use python3
> binary on its workers) once the core work is completed.
>
> Ahmet
>
> On Thu, Nov 2, 2017 at 12:58 PM, Jesse Anderson <je...@bigdatainstitute.io
> >
> wrote:
>
> > Holden is being modest in her contributions to Python frameworks,
> > especially Apache Spark.
> >
> > On Thu, Nov 2, 2017 at 12:55 PM Holden Karau <hol...@pigscanfly.ca>
> wrote:
> >
> > > Hi! So this is something I'm currently working on (e.g. in between
> > checking
> > > my e-mails :p). If you want to help join in we can split up the work
> into
> > > smaller components and parallelize the process a bit :) Always happy to
> > see
> > > more folks who care about Python 3 support.
> > >
> > > On Thu, Nov 2, 2017 at 12:44 PM, Lukasz Cwik <lc...@google.com.invalid
> >
> > > wrote:
> > >
> > > > Contributions are always welcome to improve progress.
> > > >
> > > > You can always vote/watch the Python 3 JIRA issue as this helps
> people
> > > know
> > > > what others are looking for.
> > > >
> > > > On Thu, Nov 2, 2017 at 10:33 AM, Yue Yang <shoomi...@gmail.com>
> wrote:
> > > >
> > > > > Hello,
> > > > >   I wonder what is the schedule to support python 3. It seems that
> > the
> > > > > progess is very slow.
> > > > >   Thanks.
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Twitter: https://twitter.com/holdenkarau
> > >
> > --
> > Thanks,
> >
> > Jesse
> >
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: python3 support schedule

2017-11-02 Thread Holden Karau
Hi! So this is something I'm currently working on (e.g. in between checking
my e-mails :p). If you want to help join in we can split up the work into
smaller components and parallelize the process a bit :) Always happy to see
more folks who care about Python 3 support.

On Thu, Nov 2, 2017 at 12:44 PM, Lukasz Cwik 
wrote:

> Contributions are always welcome to improve progress.
>
> You can always vote/watch the Python 3 JIRA issue as this helps people know
> what others are looking for.
>
> On Thu, Nov 2, 2017 at 10:33 AM, Yue Yang  wrote:
>
> > Hello,
> >   I wonder what is the schedule to support python 3. It seems that the
> > progess is very slow.
> >   Thanks.
> >
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: [VOTE] Migrate to gitbox

2017-10-11 Thread Holden Karau
+1 (non-binding)

On Wed, Oct 11, 2017 at 12:25 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> +1
>
> On Wed, Oct 11, 2017 at 10:53 AM, Lukasz Cwik 
> wrote:
>
> > +1
> >
> > On Tue, Oct 10, 2017 at 12:55 PM, Jason Kuster <
> > jasonkus...@google.com.invalid> wrote:
> >
> > > +1 (non-binding)
> > >
> > > Once we move to gitbox I'm happy to look at enabling MergeBot for the
> > main
> > > repo too. I'll have an email out about that later this week.
> > >
> > > On Tue, Oct 10, 2017 at 9:44 AM, Ahmet Altay  >
> > > wrote:
> > >
> > > > +1
> > > >
> > > > On Tue, Oct 10, 2017 at 9:35 AM, Thomas Groh
>  > >
> > > > wrote:
> > > >
> > > > > +1
> > > > >
> > > > > On Tue, Oct 10, 2017 at 9:12 AM, Kenneth Knowles
> > >  > > > >
> > > > > wrote:
> > > > >
> > > > > > +1
> > > > > >
> > > > > > On Tue, Oct 10, 2017 at 8:33 AM, Tyler Akidau
> > > >  > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > +1
> > > > > > >
> > > > > > > On Tue, Oct 10, 2017 at 2:13 AM Ismaël Mejía <
> ieme...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > +1 (non-binding)
> > > > > > > >
> > > > > > > > On Tue, Oct 10, 2017 at 10:42 AM, Aljoscha Krettek <
> > > > > > aljos...@apache.org>
> > > > > > > > wrote:
> > > > > > > > > +1
> > > > > > > > >
> > > > > > > > >> On 10. Oct 2017, at 09:42, Jean-Baptiste Onofré <
> > > > j...@nanthrax.net>
> > > > > > > > wrote:
> > > > > > > > >>
> > > > > > > > >> Hi all,
> > > > > > > > >>
> > > > > > > > >> following the discussion, here's the formal vote to
> migrate
> > to
> > > > > > gitbox:
> > > > > > > > >>
> > > > > > > > >> [ ] +1, Approve to migrate to gitbox
> > > > > > > > >> [ ] -1, Do not migrate (please provide specific comments)
> > > > > > > > >>
> > > > > > > > >> The vote will be open for at least 36 hours. It is adopted
> > by
> > > > > > majority
> > > > > > > > >> approval, with at least 3 PMC affirmative votes.
> > > > > > > > >>
> > > > > > > > >> Thanks,
> > > > > > > > >> Regards
> > > > > > > > >> JB
> > > > > > > > >> --
> > > > > > > > >> Jean-Baptiste Onofré
> > > > > > > > >> jbono...@apache.org
> > > > > > > > >> http://blog.nanthrax.net
> > > > > > > > >> Talend - http://www.talend.com
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > ---
> > > Jason Kuster
> > > Apache Beam / Google Cloud Dataflow
> > >
> >
>



-- 
Twitter: https://twitter.com/holdenkarau


Re: Beam spark 2.x runner status

2017-08-21 Thread Holden Karau
I'd love to take a look at the PR when it comes in (<3 BEAM + SPARK :)).

On Mon, Aug 21, 2017 at 11:33 AM, Jean-Baptiste Onofré 
wrote:

> Hi
>
> I did a new runner supporting spark 2.1.x. I changed code for that.
>
> I'm still in vacation this week. I will send an update when back.
>
> Regards
> JB
>
> On Aug 21, 2017, 09:01, at 09:01, Pei HE  wrote:
> >Any updates for upgrading to spark 2.x?
> >
> >I tried to replace the dependency and found a compile error from
> >implementing a scala trait:
> >org.apache.beam.runners.spark.io.SourceRDD.SourcePartition is not
> >abstract
> >and does not override abstract method
> >org$apache$spark$Partition$$super$equals(java.lang.Object) in
> >org.apache.spark.Partition
> >
> >(The spark side change was introduced in
> >https://github.com/apache/spark/pull/12157.)
> >
> >Does anyone have ideas about this compile error?
> >
> >
> >On Wed, May 3, 2017 at 1:32 PM, Jean-Baptiste Onofré 
> >wrote:
> >
> >> Hi Ted,
> >>
> >> My branch used Spark 2.1.0 and I just updated to 2.1.1.
> >>
> >> As discussed with Aviem, I should be able to create the pull request
> >later
> >> today.
> >>
> >> Regards
> >> JB
> >>
> >>
> >> On 05/03/2017 02:50 AM, Ted Yu wrote:
> >>
> >>> Spark 2.1.1 has been released.
> >>>
> >>> Consider using the new release in this work.
> >>>
> >>> Thanks
> >>>
> >>> On Wed, Mar 29, 2017 at 5:43 AM, Jean-Baptiste Onofré
> >
> >>> wrote:
> >>>
> >>> Cool for the PR merge, I will rebase my branch on it.
> 
>  Thanks !
>  Regards
>  JB
> 
> 
>  On 03/29/2017 01:58 PM, Amit Sela wrote:
> 
>  @Ted definitely makes sense.
> > @JB I'm merging https://github.com/apache/beam/pull/2354 soon so
> >any
> > deprecated Spark API issues should be resolved.
> >
> > On Wed, Mar 29, 2017 at 2:46 PM Ted Yu 
> >wrote:
> >
> > This is what I did over HBASE-16179:
> >
> >>
> >> -f.call((asJavaIterator(it), conn)).iterator()
> >> +// the return type is different in spark 1.x & 2.x, we
> >handle
> >> both
> >> cases
> >> +f.call(asJavaIterator(it), conn) match {
> >> +  // spark 1.x
> >> +  case iterable: Iterable[R] => iterable.iterator()
> >> +  // spark 2.x
> >> +  case iterator: Iterator[R] => iterator
> >> +}
> >>)
> >>
> >> FYI
> >>
> >> On Wed, Mar 29, 2017 at 1:47 AM, Amit Sela 
> >> wrote:
> >>
> >> Just tried to replace dependencies and see what happens:
> >>
> >>>
> >>> Most required changes are about the runner using deprecated
> >Spark
> >>> APIs,
> >>>
> >>> and
> >>
> >> after fixing them the only real issue is with the Java API for
> >>> Pair/FlatMapFunction that changed return value to Iterator (in
> >1.6 its
> >>> Iterable).
> >>>
> >>> So I'm not sure that a profile that simply sets dependency on
> >>> 1.6.3/2.1.0
> >>> is feasible.
> >>>
> >>> On Thu, Mar 23, 2017 at 10:22 AM Kobi Salant
> >
> >>> wrote:
> >>>
> >>> So, if everything is in place in Spark 2.X and we use provided
> >>>
> 
>  dependencies
> >>>
> >>> for Spark in Beam.
>  Theoretically, you can run the same code in 2.X without any
> >need for
>  a
>  branch?
> 
>  2017-03-23 9:47 GMT+02:00 Amit Sela :
> 
>  If StreamingContext is valid and we don't have to use
> >SparkSession,
> 
> >
> > and
> 
> >>>
> >> Accumulators are valid as well and we don't need AccumulatorsV2,
> >I
> >>>
> 
> > don't
> 
> >>>
> >>> see a reason this shouldn't work (which means there are still
> >tons of
> 
> > reasons this could break, but I can't think of them off the
> >top of
> > my
> >
> > head
> 
>  right now).
> >
> > @JB simply add a profile for the Spark dependencies and run
> >the
> >
> > tests -
> 
> >>>
> >> you'll have a very definitive answer ;-) .
> >>>
>  If this passes, try on a cluster running Spark 2 as well.
> >
> > Let me know of I can assist.
> >
> > On Thu, Mar 23, 2017 at 6:55 AM Jean-Baptiste Onofré <
> >
> > j...@nanthrax.net>
> 
> >>>
> >> wrote:
> >>>
> 
> > Hi guys,
> >
> >>
> >> Ismaël summarize well what I have in mind.
> >>
> >> I'm a bit late on the PoC around that (I started a branch
> >already).
> >> I will move forward over the week end.
> >>
> >> Regards
> >> JB
> >>
> >> On 03/22/2017 11:42 PM, Ismaël