Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Robert Bradshaw
On Fri, Apr 17, 2020 at 4:58 PM Holden Karau  wrote:

>
> On Fri, Apr 17, 2020 at 3:52 PM Robert Bradshaw 
> wrote:
>
>> On Fri, Apr 17, 2020 at 2:56 PM Holden Karau 
>> wrote:
>>
>>>
>>> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw 
>>> wrote:
>>>
 Hi Holden!

 I agree with Kyle that it makes sense to have some caveat about Flink
 and Spark, though at this point they're not /that/ new (at least not
 Flink).

>>> True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
>>> support isn't yet mature enough (although there is interest in integrating
>>> it in Kubeflow I believe, it hasn't happened yet).
>>>
>>
>> I might just say "not as mature." Most of the work being done now is
>> fit-n-finish. There's also some extra flags that need to be passed to work
>> around bugs in Flink itself encountered when running TFX jobs.
>>
> Does this currently work at scale? The last time I tried to use TFX on
> Beam on Flink it had difficulty at data above ~10mb.
>

The largest TFX job I've personally run on Flink is about ~1GB (local
cluster), but that was quite a while ago. As mentioned there is a flag or
two (BATCH_FORCED IIRC) you have to pass to work around Flink getting stuck
in its memory allocation routines. (I don't remember what the final status
of the TFX benchmarks on Flink is though...)

(There's the separate question of using Kubernetes to deploy/manage the
>> Flink cluster itself, but the mode where Flink workers invoke docker to
>> start up the Python binaries is pretty stable at this point.)
>>
> So we would say maybe the OSS path would be to run TFX on Beam on Flink on
> YARN (like EMR)?
>

Flink has several deployment options, and Beam doesn't care which one you
use. Basic mode of operation is that you submit an uber jar just like an
"ordinary" Flink job, and the docker command must be available on the
workers. (There are more complicated setups like the one that Lyft uses to
avoid docker-in-docker on there kubernetes deployment, but that's more
advanced usage...)

But perhaps we're getting a bit off topic here. I think "not as mature"
explains things the best. I see no reason it shouldn't run at scale, but
would like to have regular benchmarking set up to promise anything.


> I am curious what extra support Kubeflow is "missing" (or, conversely,
 what extra support it has for Dataflow that goes beyond just specifying a
 different runner) to the point that these runners are declared
 "unsupported." Or it it literally a matter of not providing user support?

>>> So the Kubeflow TFX components (in
>>> https://github.com/kubeflow/pipelines/tree/master/components) are
>>> limited to local mode.
>>>
>>
>> So in that sense it's not less supported than Dataflow?
>>
> From the component side it’s the same. But if someone wanted do it “by
> hand” Dataflow offers better support.
>

Ack.


>
>>
>>>
 On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver 
 wrote:

> Hi Holden,
>
> The note on Flink & Spark support sounds reasonable to me. I am
> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
> agree that we don't want to over-promise.
>
> I'm not so sure about the status of Dataflow here, perhaps someone
> else can comment on that.
>
> Looking forward to the book :)
>
> Kyle
>
> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
> wrote:
>
>> Hi Apache Beam Developers,
>>
>> I'm working on a book about Kubeflow, which naturally has a section
>> on TFX. I want to set users expectations correctly so I wanted to know 
>> what
>> y'all thought of this NOTE we were thinking of including in the early
>> release:
>>
>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>> Beam's Python support. You can scale your job by using the non-portable
>> dataflow component, but this requires changing your pipeline code and 
>> isn't
>> supported by Kubeflow's current TFX components. As Apache Beam's support
>> for Apache Flink & Spark improves support may be added for scaling the 
>> TFX
>> components in a portable manner.
>>
>> Does this sound reasonable to folks? I don't want to over-promise but
>> I also don't want to scare people away given all of the progress that is
>> being made in supporting the open-source runners with language 
>> portability.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  

Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
On Fri, Apr 17, 2020 at 3:52 PM Robert Bradshaw  wrote:

> On Fri, Apr 17, 2020 at 2:56 PM Holden Karau  wrote:
>
>>
>> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw 
>> wrote:
>>
>>> Hi Holden!
>>>
>>> I agree with Kyle that it makes sense to have some caveat about Flink
>>> and Spark, though at this point they're not /that/ new (at least not
>>> Flink).
>>>
>> True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
>> support isn't yet mature enough (although there is interest in integrating
>> it in Kubeflow I believe, it hasn't happened yet).
>>
>
> I might just say "not as mature." Most of the work being done now is
> fit-n-finish. There's also some extra flags that need to be passed to work
> around bugs in Flink itself encountered when running TFX jobs.
>
Does this currently work at scale? The last time I tried to use TFX on Beam
on Flink it had difficulty at data above ~10mb.

> (There's the separate question of using Kubernetes to deploy/manage the
> Flink cluster itself, but the mode where Flink workers invoke docker to
> start up the Python binaries is pretty stable at this point.)
>
So we would say maybe the OSS path would be to run TFX on Beam on Flink on
YARN (like EMR)?

>
>
>> I am curious what extra support Kubeflow is "missing" (or, conversely,
>>> what extra support it has for Dataflow that goes beyond just specifying a
>>> different runner) to the point that these runners are declared
>>> "unsupported." Or it it literally a matter of not providing user support?
>>>
>> So the Kubeflow TFX components (in
>> https://github.com/kubeflow/pipelines/tree/master/components) are
>> limited to local mode.
>>
>
> So in that sense it's not less supported than Dataflow?
>
>From the component side it’s the same. But if someone wanted do it “by
hand” Dataflow offers better support.

>
>
>>
>>> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver 
>>> wrote:
>>>
 Hi Holden,

 The note on Flink & Spark support sounds reasonable to me. I am
 optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
 agree that we don't want to over-promise.

 I'm not so sure about the status of Dataflow here, perhaps someone else
 can comment on that.

 Looking forward to the book :)

 Kyle

 On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
 wrote:

> Hi Apache Beam Developers,
>
> I'm working on a book about Kubeflow, which naturally has a section on
> TFX. I want to set users expectations correctly so I wanted to know what
> y'all thought of this NOTE we were thinking of including in the early
> release:
>
> Apache Beam’s Python support outside of Google cloud's Dataflow is
> relatively new. TFX is a Python tool, so scaling it depends on Apache
> Beam's Python support. You can scale your job by using the non-portable
> dataflow component, but this requires changing your pipeline code and 
> isn't
> supported by Kubeflow's current TFX components. As Apache Beam's support
> for Apache Flink & Spark improves support may be added for scaling the TFX
> components in a portable manner.
>
> Does this sound reasonable to folks? I don't want to over-promise but
> I also don't want to scare people away given all of the progress that is
> being made in supporting the open-source runners with language 
> portability.
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Robert Bradshaw
On Fri, Apr 17, 2020 at 2:56 PM Holden Karau  wrote:

>
> On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw 
> wrote:
>
>> Hi Holden!
>>
>> I agree with Kyle that it makes sense to have some caveat about Flink and
>> Spark, though at this point they're not /that/ new (at least not Flink).
>>
> True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
> support isn't yet mature enough (although there is interest in integrating
> it in Kubeflow I believe, it hasn't happened yet).
>

I might just say "not as mature." Most of the work being done now is
fit-n-finish. There's also some extra flags that need to be passed to work
around bugs in Flink itself encountered when running TFX jobs. (There's the
separate question of using kuberneties to deploy/manage the Flink cluster
itself, but the mode where Flink workers invoke docker to start up the
Python binaries is pretty stable at this point.)


> I am curious what extra support Kubeflow is "missing" (or, conversely,
>> what extra support it has for Dataflow that goes beyond just specifying a
>> different runner) to the point that these runners are declared
>> "unsupported." Or it it literally a matter of not providing user support?
>>
> So the Kubeflow TFX components (in
> https://github.com/kubeflow/pipelines/tree/master/components) are limited
> to local mode.
>

So in that sense it's not less supported than Dataflow?


>
>> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:
>>
>>> Hi Holden,
>>>
>>> The note on Flink & Spark support sounds reasonable to me. I am
>>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>>> agree that we don't want to over-promise.
>>>
>>> I'm not so sure about the status of Dataflow here, perhaps someone else
>>> can comment on that.
>>>
>>> Looking forward to the book :)
>>>
>>> Kyle
>>>
>>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
>>> wrote:
>>>
 Hi Apache Beam Developers,

 I'm working on a book about Kubeflow, which naturally has a section on
 TFX. I want to set users expectations correctly so I wanted to know what
 y'all thought of this NOTE we were thinking of including in the early
 release:

 Apache Beam’s Python support outside of Google cloud's Dataflow is
 relatively new. TFX is a Python tool, so scaling it depends on Apache
 Beam's Python support. You can scale your job by using the non-portable
 dataflow component, but this requires changing your pipeline code and isn't
 supported by Kubeflow's current TFX components. As Apache Beam's support
 for Apache Flink & Spark improves support may be added for scaling the TFX
 components in a portable manner.

 Does this sound reasonable to folks? I don't want to over-promise but I
 also don't want to scare people away given all of the progress that is
 being made in supporting the open-source runners with language portability.

 Cheers,

 Holden :)

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
On Fri, Apr 17, 2020 at 2:45 PM Robert Bradshaw  wrote:

> Hi Holden!
>
> I agree with Kyle that it makes sense to have some caveat about Flink and
> Spark, though at this point they're not /that/ new (at least not Flink).
>
True, maybe "early-stage" would be better wording?  The TFX PyBeam Flink
support isn't yet mature enough (although there is interest in integrating
it in Kubeflow I believe, it hasn't happened yet).

>
> I am curious what extra support Kubeflow is "missing" (or, conversely,
> what extra support it has for Dataflow that goes beyond just specifying a
> different runner) to the point that these runners are declared
> "unsupported." Or it it literally a matter of not providing user support?
>
So the Kubeflow TFX components (in
https://github.com/kubeflow/pipelines/tree/master/components) are limited
to local mode.

>
> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:
>
>> Hi Holden,
>>
>> The note on Flink & Spark support sounds reasonable to me. I am
>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>> agree that we don't want to over-promise.
>>
>> I'm not so sure about the status of Dataflow here, perhaps someone else
>> can comment on that.
>>
>> Looking forward to the book :)
>>
>> Kyle
>>
>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
>> wrote:
>>
>>> Hi Apache Beam Developers,
>>>
>>> I'm working on a book about Kubeflow, which naturally has a section on
>>> TFX. I want to set users expectations correctly so I wanted to know what
>>> y'all thought of this NOTE we were thinking of including in the early
>>> release:
>>>
>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>> Beam's Python support. You can scale your job by using the non-portable
>>> dataflow component, but this requires changing your pipeline code and isn't
>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>> for Apache Flink & Spark improves support may be added for scaling the TFX
>>> components in a portable manner.
>>>
>>> Does this sound reasonable to folks? I don't want to over-promise but I
>>> also don't want to scare people away given all of the progress that is
>>> being made in supporting the open-source runners with language portability.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>

-- 
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Holden Karau
On Fri, Apr 17, 2020 at 2:32 PM Ahmet Altay  wrote:

> Hi Holden, nice to hear from you. Thanks a lot for this email. Adding some
> TFX folks as well. +Robert Crowe  +Irene
> Giannoumis  +Zhitao Li  +Anusha
> Ramesh 
>
> Would it be possible for TFX folks to review the TFX section of your book?
>
Sure. Currently we only cover TFT and TFDV and I can share the draft of
that chapter with TFX folks but we might cover more later.

>
> On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:
>
>> Hi Holden,
>>
>> The note on Flink & Spark support sounds reasonable to me. I am
>> optimistic about getting Flink + TFX + Kubeflow working fairly soon, but I
>> agree that we don't want to over-promise.
>>
>> I'm not so sure about the status of Dataflow here, perhaps someone else
>> can comment on that.
>>
>
> I believe TFX/KFP works on Dataflow with the same pipeline. (They have an
> example on this
> https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/template.ipynb
>  -
> step 8)
>
>
So that is only the TFX pipeline, if you want to use Kubeflow pipelines
with the TFX components that’s not supported.

>
>> Looking forward to the book :)
>>
>> Kyle
>>
>> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau 
>> wrote:
>>
>>> Hi Apache Beam Developers,
>>>
>>> I'm working on a book about Kubeflow, which naturally has a section on
>>> TFX. I want to set users expectations correctly so I wanted to know what
>>> y'all thought of this NOTE we were thinking of including in the early
>>> release:
>>>
>>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>>> Beam's Python support. You can scale your job by using the non-portable
>>> dataflow component, but this requires changing your pipeline code and isn't
>>> supported by Kubeflow's current TFX components. As Apache Beam's support
>>> for Apache Flink & Spark improves support may be added for scaling the TFX
>>> components in a portable manner.
>>>
>>> Does this sound reasonable to folks? I don't want to over-promise but I
>>> also don't want to scare people away given all of the progress that is
>>> being made in supporting the open-source runners with language portability.
>>>
>>> Cheers,
>>>
>>> Holden :)
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Robert Bradshaw
Hi Holden!

I agree with Kyle that it makes sense to have some caveat about Flink and
Spark, though at this point they're not /that/ new (at least not Flink).

I am curious what extra support Kubeflow is "missing" (or, conversely, what
extra support it has for Dataflow that goes beyond just specifying a
different runner) to the point that these runners are declared
"unsupported." Or it it literally a matter of not providing user support?

On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:

> Hi Holden,
>
> The note on Flink & Spark support sounds reasonable to me. I am optimistic
> about getting Flink + TFX + Kubeflow working fairly soon, but I agree that
> we don't want to over-promise.
>
> I'm not so sure about the status of Dataflow here, perhaps someone else
> can comment on that.
>
> Looking forward to the book :)
>
> Kyle
>
> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau  wrote:
>
>> Hi Apache Beam Developers,
>>
>> I'm working on a book about Kubeflow, which naturally has a section on
>> TFX. I want to set users expectations correctly so I wanted to know what
>> y'all thought of this NOTE we were thinking of including in the early
>> release:
>>
>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>> Beam's Python support. You can scale your job by using the non-portable
>> dataflow component, but this requires changing your pipeline code and isn't
>> supported by Kubeflow's current TFX components. As Apache Beam's support
>> for Apache Flink & Spark improves support may be added for scaling the TFX
>> components in a portable manner.
>>
>> Does this sound reasonable to folks? I don't want to over-promise but I
>> also don't want to scare people away given all of the progress that is
>> being made in supporting the open-source runners with language portability.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Ahmet Altay
Hi Holden, nice to hear from you. Thanks a lot for this email. Adding some
TFX folks as well. +Robert Crowe  +Irene Giannoumis
 +Zhitao Li  +Anusha Ramesh


Would it be possible for TFX folks to review the TFX section of your book?

On Fri, Apr 17, 2020 at 12:27 PM Kyle Weaver  wrote:

> Hi Holden,
>
> The note on Flink & Spark support sounds reasonable to me. I am optimistic
> about getting Flink + TFX + Kubeflow working fairly soon, but I agree that
> we don't want to over-promise.
>
> I'm not so sure about the status of Dataflow here, perhaps someone else
> can comment on that.
>

I believe TFX/KFP works on Dataflow with the same pipeline. (They have an
example on this
https://github.com/tensorflow/tfx/blob/master/docs/tutorials/tfx/template.ipynb
-
step 8)


>
> Looking forward to the book :)
>
> Kyle
>
> On Fri, Apr 17, 2020 at 1:14 PM Holden Karau  wrote:
>
>> Hi Apache Beam Developers,
>>
>> I'm working on a book about Kubeflow, which naturally has a section on
>> TFX. I want to set users expectations correctly so I wanted to know what
>> y'all thought of this NOTE we were thinking of including in the early
>> release:
>>
>> Apache Beam’s Python support outside of Google cloud's Dataflow is
>> relatively new. TFX is a Python tool, so scaling it depends on Apache
>> Beam's Python support. You can scale your job by using the non-portable
>> dataflow component, but this requires changing your pipeline code and isn't
>> supported by Kubeflow's current TFX components. As Apache Beam's support
>> for Apache Flink & Spark improves support may be added for scaling the TFX
>> components in a portable manner.
>>
>> Does this sound reasonable to folks? I don't want to over-promise but I
>> also don't want to scare people away given all of the progress that is
>> being made in supporting the open-source runners with language portability.
>>
>> Cheers,
>>
>> Holden :)
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Reference to Beam in upcoming Kubeflow Book

2020-04-17 Thread Kyle Weaver
Hi Holden,

The note on Flink & Spark support sounds reasonable to me. I am optimistic
about getting Flink + TFX + Kubeflow working fairly soon, but I agree that
we don't want to over-promise.

I'm not so sure about the status of Dataflow here, perhaps someone else can
comment on that.

Looking forward to the book :)

Kyle

On Fri, Apr 17, 2020 at 1:14 PM Holden Karau  wrote:

> Hi Apache Beam Developers,
>
> I'm working on a book about Kubeflow, which naturally has a section on
> TFX. I want to set users expectations correctly so I wanted to know what
> y'all thought of this NOTE we were thinking of including in the early
> release:
>
> Apache Beam’s Python support outside of Google cloud's Dataflow is
> relatively new. TFX is a Python tool, so scaling it depends on Apache
> Beam's Python support. You can scale your job by using the non-portable
> dataflow component, but this requires changing your pipeline code and isn't
> supported by Kubeflow's current TFX components. As Apache Beam's support
> for Apache Flink & Spark improves support may be added for scaling the TFX
> components in a portable manner.
>
> Does this sound reasonable to folks? I don't want to over-promise but I
> also don't want to scare people away given all of the progress that is
> being made in supporting the open-source runners with language portability.
>
> Cheers,
>
> Holden :)
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>