Re: Python friendly API for Spark 3.0

2018-09-29 Thread Stavros Kontopoulos
Regarding Python 3.x upgrade referenced earlier. Some people already gone
down that path of upgrading:

https://blogs.dropbox.com/tech/2018/09/how-we-rolled-out-one-of-the-largest-python-3-migrations-ever

They describe some good reasons.

Stavros

On Tue, Sep 18, 2018 at 6:35 PM, Erik Erlandson  wrote:

> I like the notion of empowering cross platform bindings.
>
> The trend of computing frameworks seems to be that all APIs gradually
> converge on a stable attractor which could be described as "data frames and
> SQL"  Spark's early API design was RDD focused, but these days the center
> of gravity is all about DataFrame (Python's prevalence combined with its
> lack of a static type system substantially dilutes the benefits of DataSet,
> for any library development that aspires to both JVM and python support).
>
> I can imagine optimizing the developer layers of Spark APIs so that cross
> platform support and also 3rd-party support for new and existing Spark
> bindings would be maximized for "parallelizable dataframe+SQL"  Another of
> Spark's strengths is it's ability to federate heterogeneous data sources,
> and making cross platform bindings easy for that is desirable.
>
>
> On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra 
> wrote:
>
>> It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>>
>> For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>>
>> It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>>
>> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>>
>>> I am probably splitting hairs to finely, but I was considering the
>>> difference between improvements to the jvm-side (py4j and the scala/java
>>> code) that would make it easier to write the python layer ("python-friendly
>>> api"), and actual improvements to the python layers ("friendly python api").
>>>
>>> They're not mutually exclusive of course, and both worth working on. But
>>> it's *possible* to improve either without the other.
>>>
>>> Stub files look like a great solution for type annotations, maybe even
>>> if only python 3 is supported.
>>>
>>> I definitely agree that any decision to drop python 2 should not be
>>> taken lightly. Anecdotally, I'm seeing an increase in python developers
>>> announcing that they are dropping support for python 2 (and loving it). As
>>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>>> we're stuck with it until 4.0, which would place spark in a
>>> possibly-awkward position of supporting python 2 for some time after it
>>> goes EOL.
>>>
>>> 

Re: Python friendly API for Spark 3.0

2018-09-18 Thread Erik Erlandson
I like the notion of empowering cross platform bindings.

The trend of computing frameworks seems to be that all APIs gradually
converge on a stable attractor which could be described as "data frames and
SQL"  Spark's early API design was RDD focused, but these days the center
of gravity is all about DataFrame (Python's prevalence combined with its
lack of a static type system substantially dilutes the benefits of DataSet,
for any library development that aspires to both JVM and python support).

I can imagine optimizing the developer layers of Spark APIs so that cross
platform support and also 3rd-party support for new and existing Spark
bindings would be maximized for "parallelizable dataframe+SQL"  Another of
Spark's strengths is it's ability to federate heterogeneous data sources,
and making cross platform bindings easy for that is desirable.


On Sun, Sep 16, 2018 at 1:02 PM, Mark Hamstra 
wrote:

> It's not splitting hairs, Erik. It's actually very close to something that
> I think deserves some discussion (perhaps on a separate thread.) What I've
> been thinking about also concerns API "friendliness" or style. The original
> RDD API was very intentionally modeled on the Scala parallel collections
> API. That made it quite friendly for some Scala programmers, but not as
> much so for users of the other language APIs when they eventually came
> about. Similarly, the Dataframe API drew a lot from pandas and R, so it is
> relatively friendly for those used to those abstractions. Of course, the
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new
> barrier scheduling draws inspiration from MPI. With all of these models and
> sources of inspiration, as well as multiple language targets, there isn't
> really a strong sense of coherence across Spark -- I mean, even though one
> of the key advantages of Spark is the ability to do within a single
> framework things that would otherwise require multiple frameworks, actually
> doing that is requiring more than one programming style or multiple design
> abstractions more than what is strictly necessary even when writing Spark
> code in just a single language.
>
> For me, that raises questions over whether we want to start designing,
> implementing and supporting APIs that are designed to be more consistent,
> friendly and idiomatic to particular languages and abstractions -- e.g. an
> API covering all of Spark that is designed to look and feel as much like
> "normal" code for a Python programmer, another that looks and feels more
> like "normal" Java code, another for Scala, etc. That's a lot more work and
> support burden than the current approach where sometimes it feels like you
> are writing "normal" code for your prefered programming environment, and
> sometimes it feels like you are trying to interface with something foreign,
> but underneath it hopefully isn't too hard for those writing the
> implementation code below the APIs, and it is not too hard to maintain
> multiple language bindings that are each fairly lightweight.
>
> It's a cost-benefit judgement, of course, whether APIs that are heavier
> (in terms of implementing and maintaining) and friendlier (for end users)
> are worth doing, and maybe some of these "friendlier" APIs can be done
> outside of Spark itself (imo, Frameless is doing a very nice job for the
> parts of Spark that it is currently covering -- https://github.com/
> typelevel/frameless); but what we have currently is a bit too ad hoc and
> fragmentary for my taste.
>
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
> wrote:
>
>> I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>>
>> They're not mutually exclusive of course, and both worth working on. But
>> it's *possible* to improve either without the other.
>>
>> Stub files look like a great solution for type annotations, maybe even if
>> only python 3 is supported.
>>
>> I definitely agree that any decision to drop python 2 should not be taken
>> lightly. Anecdotally, I'm seeing an increase in python developers
>> announcing that they are dropping support for python 2 (and loving it). As
>> people have already pointed out, if we don't drop python 2 for spark 3.0,
>> we're stuck with it until 4.0, which would place spark in a
>> possibly-awkward position of supporting python 2 for some time after it
>> goes EOL.
>>
>> Under the current release cadence, spark 3.0 will land some time in early
>> 2019, which at that point will be mere months until EOL for py2.
>>
>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
>> wrote:
>>
>>>
>>>
>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
>>> wrote:
>>>
 To be clear, is this about "python-friendly API" or "friendly python
 API" ?

>>> Well what would you 

Re: Python friendly API for Spark 3.0

2018-09-17 Thread Leif Walsh
I agree with Reynold, at some point you’re going to run into the parts of
the pandas API that aren’t distributable. More feature parity will be good,
but users are still eventually going to hit a feature cliff. Moreover, it’s
not just the pandas API that people want to use, but also the set of
libraries built around the pandas DataFrame structure.

I think rather than similarity to pandas, we should target smoother
interoperability with pandas, to ease the pain of hitting this cliff.

We’ve been working on part of this problem with the pandas UDF stuff, but
there’s a lot more to do.

On Sun, Sep 16, 2018 at 17:13 Reynold Xin  wrote:

> Most of those are pretty difficult to add though, because they are
> fundamentally difficult to do in a distributed setting and with lazy
> execution.
>
> We should add some but at some point there are fundamental differences
> between the underlying execution engine that are pretty difficult to
> reconcile.
>
> On Sun, Sep 16, 2018 at 2:09 PM Matei Zaharia 
> wrote:
>
>> My 2 cents on this is that the biggest room for improvement in Python is
>> similarity to Pandas. We already made the Python DataFrame API different
>> from Scala/Java in some respects, but if there’s anything we can do to make
>> it more obvious to Pandas users, that will help the most. The other issue
>> though is that a bunch of Pandas functions are just missing in Spark — it
>> would be awesome to set up an umbrella JIRA to just track those and let
>> people fill them in.
>>
>> Matei
>>
>> > On Sep 16, 2018, at 1:02 PM, Mark Hamstra 
>> wrote:
>> >
>> > It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>> >
>> > For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>> >
>> > It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>> >
>> > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>> > I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>> >
>> > They're not mutually exclusive of course, and both worth working on.
>> But it's *possible* to improve either without the other.
>> >
>> > Stub files look like a great solution for type annotations, maybe even
>> if only python 3 is supported.
>> >
>> > I definitely agree that any decision to drop python 2 should not be
>> taken lightly. Anecdotally, I'm seeing an increase in 

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
>
> difficult to reconcile
>

That's a big chunk of what I'm getting at: How much is it even possible to
do this kind of reconciliation from the underlying implementation to a more
normal/expected/friendly API for a given programming environment? How much
more work is it for us to maintain multiple such reconciliations, one for
each environment? Do we even need to do it at all, or can we push off such
higher-level reconciliations to 3rd-party efforts like Frameless?


On Sun, Sep 16, 2018 at 2:12 PM Reynold Xin  wrote:

> Most of those are pretty difficult to add though, because they are
> fundamentally difficult to do in a distributed setting and with lazy
> execution.
>
> We should add some but at some point there are fundamental differences
> between the underlying execution engine that are pretty difficult to
> reconcile.
>
> On Sun, Sep 16, 2018 at 2:09 PM Matei Zaharia 
> wrote:
>
>> My 2 cents on this is that the biggest room for improvement in Python is
>> similarity to Pandas. We already made the Python DataFrame API different
>> from Scala/Java in some respects, but if there’s anything we can do to make
>> it more obvious to Pandas users, that will help the most. The other issue
>> though is that a bunch of Pandas functions are just missing in Spark — it
>> would be awesome to set up an umbrella JIRA to just track those and let
>> people fill them in.
>>
>> Matei
>>
>> > On Sep 16, 2018, at 1:02 PM, Mark Hamstra 
>> wrote:
>> >
>> > It's not splitting hairs, Erik. It's actually very close to something
>> that I think deserves some discussion (perhaps on a separate thread.) What
>> I've been thinking about also concerns API "friendliness" or style. The
>> original RDD API was very intentionally modeled on the Scala parallel
>> collections API. That made it quite friendly for some Scala programmers,
>> but not as much so for users of the other language APIs when they
>> eventually came about. Similarly, the Dataframe API drew a lot from pandas
>> and R, so it is relatively friendly for those used to those abstractions.
>> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
>> The new barrier scheduling draws inspiration from MPI. With all of these
>> models and sources of inspiration, as well as multiple language targets,
>> there isn't really a strong sense of coherence across Spark -- I mean, even
>> though one of the key advantages of Spark is the ability to do within a
>> single framework things that would otherwise require multiple frameworks,
>> actually doing that is requiring more than one programming style or
>> multiple design abstractions more than what is strictly necessary even when
>> writing Spark code in just a single language.
>> >
>> > For me, that raises questions over whether we want to start designing,
>> implementing and supporting APIs that are designed to be more consistent,
>> friendly and idiomatic to particular languages and abstractions -- e.g. an
>> API covering all of Spark that is designed to look and feel as much like
>> "normal" code for a Python programmer, another that looks and feels more
>> like "normal" Java code, another for Scala, etc. That's a lot more work and
>> support burden than the current approach where sometimes it feels like you
>> are writing "normal" code for your prefered programming environment, and
>> sometimes it feels like you are trying to interface with something foreign,
>> but underneath it hopefully isn't too hard for those writing the
>> implementation code below the APIs, and it is not too hard to maintain
>> multiple language bindings that are each fairly lightweight.
>> >
>> > It's a cost-benefit judgement, of course, whether APIs that are heavier
>> (in terms of implementing and maintaining) and friendlier (for end users)
>> are worth doing, and maybe some of these "friendlier" APIs can be done
>> outside of Spark itself (imo, Frameless is doing a very nice job for the
>> parts of Spark that it is currently covering --
>> https://github.com/typelevel/frameless); but what we have currently is a
>> bit too ad hoc and fragmentary for my taste.
>> >
>> > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
>> wrote:
>> > I am probably splitting hairs to finely, but I was considering the
>> difference between improvements to the jvm-side (py4j and the scala/java
>> code) that would make it easier to write the python layer ("python-friendly
>> api"), and actual improvements to the python layers ("friendly python api").
>> >
>> > They're not mutually exclusive of course, and both worth working on.
>> But it's *possible* to improve either without the other.
>> >
>> > Stub files look like a great solution for type annotations, maybe even
>> if only python 3 is supported.
>> >
>> > I definitely agree that any decision to drop python 2 should not be
>> taken lightly. Anecdotally, I'm seeing an increase in python developers
>> announcing that they are dropping support for python 2 (and loving it). As
>> people have already 

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Reynold Xin
Most of those are pretty difficult to add though, because they are
fundamentally difficult to do in a distributed setting and with lazy
execution.

We should add some but at some point there are fundamental differences
between the underlying execution engine that are pretty difficult to
reconcile.

On Sun, Sep 16, 2018 at 2:09 PM Matei Zaharia 
wrote:

> My 2 cents on this is that the biggest room for improvement in Python is
> similarity to Pandas. We already made the Python DataFrame API different
> from Scala/Java in some respects, but if there’s anything we can do to make
> it more obvious to Pandas users, that will help the most. The other issue
> though is that a bunch of Pandas functions are just missing in Spark — it
> would be awesome to set up an umbrella JIRA to just track those and let
> people fill them in.
>
> Matei
>
> > On Sep 16, 2018, at 1:02 PM, Mark Hamstra 
> wrote:
> >
> > It's not splitting hairs, Erik. It's actually very close to something
> that I think deserves some discussion (perhaps on a separate thread.) What
> I've been thinking about also concerns API "friendliness" or style. The
> original RDD API was very intentionally modeled on the Scala parallel
> collections API. That made it quite friendly for some Scala programmers,
> but not as much so for users of the other language APIs when they
> eventually came about. Similarly, the Dataframe API drew a lot from pandas
> and R, so it is relatively friendly for those used to those abstractions.
> Of course, the Spark SQL API is modeled closely on HiveQL and standard SQL.
> The new barrier scheduling draws inspiration from MPI. With all of these
> models and sources of inspiration, as well as multiple language targets,
> there isn't really a strong sense of coherence across Spark -- I mean, even
> though one of the key advantages of Spark is the ability to do within a
> single framework things that would otherwise require multiple frameworks,
> actually doing that is requiring more than one programming style or
> multiple design abstractions more than what is strictly necessary even when
> writing Spark code in just a single language.
> >
> > For me, that raises questions over whether we want to start designing,
> implementing and supporting APIs that are designed to be more consistent,
> friendly and idiomatic to particular languages and abstractions -- e.g. an
> API covering all of Spark that is designed to look and feel as much like
> "normal" code for a Python programmer, another that looks and feels more
> like "normal" Java code, another for Scala, etc. That's a lot more work and
> support burden than the current approach where sometimes it feels like you
> are writing "normal" code for your prefered programming environment, and
> sometimes it feels like you are trying to interface with something foreign,
> but underneath it hopefully isn't too hard for those writing the
> implementation code below the APIs, and it is not too hard to maintain
> multiple language bindings that are each fairly lightweight.
> >
> > It's a cost-benefit judgement, of course, whether APIs that are heavier
> (in terms of implementing and maintaining) and friendlier (for end users)
> are worth doing, and maybe some of these "friendlier" APIs can be done
> outside of Spark itself (imo, Frameless is doing a very nice job for the
> parts of Spark that it is currently covering --
> https://github.com/typelevel/frameless); but what we have currently is a
> bit too ad hoc and fragmentary for my taste.
> >
> > On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson 
> wrote:
> > I am probably splitting hairs to finely, but I was considering the
> difference between improvements to the jvm-side (py4j and the scala/java
> code) that would make it easier to write the python layer ("python-friendly
> api"), and actual improvements to the python layers ("friendly python api").
> >
> > They're not mutually exclusive of course, and both worth working on. But
> it's *possible* to improve either without the other.
> >
> > Stub files look like a great solution for type annotations, maybe even
> if only python 3 is supported.
> >
> > I definitely agree that any decision to drop python 2 should not be
> taken lightly. Anecdotally, I'm seeing an increase in python developers
> announcing that they are dropping support for python 2 (and loving it). As
> people have already pointed out, if we don't drop python 2 for spark 3.0,
> we're stuck with it until 4.0, which would place spark in a
> possibly-awkward position of supporting python 2 for some time after it
> goes EOL.
> >
> > Under the current release cadence, spark 3.0 will land some time in
> early 2019, which at that point will be mere months until EOL for py2.
> >
> > On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
> wrote:
> >
> >
> > On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
> wrote:
> > To be clear, is this about "python-friendly API" or "friendly python
> API" ?
> > Well what would you consider to be different 

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Matei Zaharia
My 2 cents on this is that the biggest room for improvement in Python is 
similarity to Pandas. We already made the Python DataFrame API different from 
Scala/Java in some respects, but if there’s anything we can do to make it more 
obvious to Pandas users, that will help the most. The other issue though is 
that a bunch of Pandas functions are just missing in Spark — it would be 
awesome to set up an umbrella JIRA to just track those and let people fill them 
in.

Matei

> On Sep 16, 2018, at 1:02 PM, Mark Hamstra  wrote:
> 
> It's not splitting hairs, Erik. It's actually very close to something that I 
> think deserves some discussion (perhaps on a separate thread.) What I've been 
> thinking about also concerns API "friendliness" or style. The original RDD 
> API was very intentionally modeled on the Scala parallel collections API. 
> That made it quite friendly for some Scala programmers, but not as much so 
> for users of the other language APIs when they eventually came about. 
> Similarly, the Dataframe API drew a lot from pandas and R, so it is 
> relatively friendly for those used to those abstractions. Of course, the 
> Spark SQL API is modeled closely on HiveQL and standard SQL. The new barrier 
> scheduling draws inspiration from MPI. With all of these models and sources 
> of inspiration, as well as multiple language targets, there isn't really a 
> strong sense of coherence across Spark -- I mean, even though one of the key 
> advantages of Spark is the ability to do within a single framework things 
> that would otherwise require multiple frameworks, actually doing that is 
> requiring more than one programming style or multiple design abstractions 
> more than what is strictly necessary even when writing Spark code in just a 
> single language.
> 
> For me, that raises questions over whether we want to start designing, 
> implementing and supporting APIs that are designed to be more consistent, 
> friendly and idiomatic to particular languages and abstractions -- e.g. an 
> API covering all of Spark that is designed to look and feel as much like 
> "normal" code for a Python programmer, another that looks and feels more like 
> "normal" Java code, another for Scala, etc. That's a lot more work and 
> support burden than the current approach where sometimes it feels like you 
> are writing "normal" code for your prefered programming environment, and 
> sometimes it feels like you are trying to interface with something foreign, 
> but underneath it hopefully isn't too hard for those writing the 
> implementation code below the APIs, and it is not too hard to maintain 
> multiple language bindings that are each fairly lightweight.
> 
> It's a cost-benefit judgement, of course, whether APIs that are heavier (in 
> terms of implementing and maintaining) and friendlier (for end users) are 
> worth doing, and maybe some of these "friendlier" APIs can be done outside of 
> Spark itself (imo, Frameless is doing a very nice job for the parts of Spark 
> that it is currently covering -- https://github.com/typelevel/frameless); but 
> what we have currently is a bit too ad hoc and fragmentary for my taste. 
> 
> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:
> I am probably splitting hairs to finely, but I was considering the difference 
> between improvements to the jvm-side (py4j and the scala/java code) that 
> would make it easier to write the python layer ("python-friendly api"), and 
> actual improvements to the python layers ("friendly python api").
> 
> They're not mutually exclusive of course, and both worth working on. But it's 
> *possible* to improve either without the other.
> 
> Stub files look like a great solution for type annotations, maybe even if 
> only python 3 is supported.
> 
> I definitely agree that any decision to drop python 2 should not be taken 
> lightly. Anecdotally, I'm seeing an increase in python developers announcing 
> that they are dropping support for python 2 (and loving it). As people have 
> already pointed out, if we don't drop python 2 for spark 3.0, we're stuck 
> with it until 4.0, which would place spark in a possibly-awkward position of 
> supporting python 2 for some time after it goes EOL.
> 
> Under the current release cadence, spark 3.0 will land some time in early 
> 2019, which at that point will be mere months until EOL for py2.
> 
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:
> 
> 
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
> To be clear, is this about "python-friendly API" or "friendly python API" ?
> Well what would you consider to be different between those two statements? I 
> think it would be good to be a bit more explicit, but I don't think we should 
> necessarily limit ourselves.
> 
> On the python side, it might be nice to take advantage of static typing. 
> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a good 
> opportunity to jump the python-3-only train.
> I think we can make 

Re: Python friendly API for Spark 3.0

2018-09-16 Thread Mark Hamstra
It's not splitting hairs, Erik. It's actually very close to something that
I think deserves some discussion (perhaps on a separate thread.) What I've
been thinking about also concerns API "friendliness" or style. The original
RDD API was very intentionally modeled on the Scala parallel collections
API. That made it quite friendly for some Scala programmers, but not as
much so for users of the other language APIs when they eventually came
about. Similarly, the Dataframe API drew a lot from pandas and R, so it is
relatively friendly for those used to those abstractions. Of course, the
Spark SQL API is modeled closely on HiveQL and standard SQL. The new
barrier scheduling draws inspiration from MPI. With all of these models and
sources of inspiration, as well as multiple language targets, there isn't
really a strong sense of coherence across Spark -- I mean, even though one
of the key advantages of Spark is the ability to do within a single
framework things that would otherwise require multiple frameworks, actually
doing that is requiring more than one programming style or multiple design
abstractions more than what is strictly necessary even when writing Spark
code in just a single language.

For me, that raises questions over whether we want to start designing,
implementing and supporting APIs that are designed to be more consistent,
friendly and idiomatic to particular languages and abstractions -- e.g. an
API covering all of Spark that is designed to look and feel as much like
"normal" code for a Python programmer, another that looks and feels more
like "normal" Java code, another for Scala, etc. That's a lot more work and
support burden than the current approach where sometimes it feels like you
are writing "normal" code for your prefered programming environment, and
sometimes it feels like you are trying to interface with something foreign,
but underneath it hopefully isn't too hard for those writing the
implementation code below the APIs, and it is not too hard to maintain
multiple language bindings that are each fairly lightweight.

It's a cost-benefit judgement, of course, whether APIs that are heavier (in
terms of implementing and maintaining) and friendlier (for end users) are
worth doing, and maybe some of these "friendlier" APIs can be done outside
of Spark itself (imo, Frameless is doing a very nice job for the parts of
Spark that it is currently covering --
https://github.com/typelevel/frameless); but what we have currently is a
bit too ad hoc and fragmentary for my taste.

On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:

> I am probably splitting hairs to finely, but I was considering the
> difference between improvements to the jvm-side (py4j and the scala/java
> code) that would make it easier to write the python layer ("python-friendly
> api"), and actual improvements to the python layers ("friendly python api").
>
> They're not mutually exclusive of course, and both worth working on. But
> it's *possible* to improve either without the other.
>
> Stub files look like a great solution for type annotations, maybe even if
> only python 3 is supported.
>
> I definitely agree that any decision to drop python 2 should not be taken
> lightly. Anecdotally, I'm seeing an increase in python developers
> announcing that they are dropping support for python 2 (and loving it). As
> people have already pointed out, if we don't drop python 2 for spark 3.0,
> we're stuck with it until 4.0, which would place spark in a
> possibly-awkward position of supporting python 2 for some time after it
> goes EOL.
>
> Under the current release cadence, spark 3.0 will land some time in early
> 2019, which at that point will be mere months until EOL for py2.
>
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
> wrote:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have 

Re: Python friendly API for Spark 3.0

2018-09-15 Thread Jules Damji
+1 
I think phasing out EOL of any feature or supported language is a better 
strategy if possible than a quick drop. With enough admonition, it can 
gradually be dropped in 3.x— of course, there are exceptions. 

Cheers
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Sep 15, 2018, at 10:49 AM, Reynold Xin  wrote:
> 
> we can also declare python 2 as deprecated and drop it in 3.x, not 
> necessarily 3.0.
> 
> --
> excuse the brevity and lower case due to wrist injury
> 
> 
>> On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:
>> I am probably splitting hairs to finely, but I was considering the 
>> difference between improvements to the jvm-side (py4j and the scala/java 
>> code) that would make it easier to write the python layer ("python-friendly 
>> api"), and actual improvements to the python layers ("friendly python api").
>> 
>> They're not mutually exclusive of course, and both worth working on. But 
>> it's *possible* to improve either without the other.
>> 
>> Stub files look like a great solution for type annotations, maybe even if 
>> only python 3 is supported.
>> 
>> I definitely agree that any decision to drop python 2 should not be taken 
>> lightly. Anecdotally, I'm seeing an increase in python developers announcing 
>> that they are dropping support for python 2 (and loving it). As people have 
>> already pointed out, if we don't drop python 2 for spark 3.0, we're stuck 
>> with it until 4.0, which would place spark in a possibly-awkward position of 
>> supporting python 2 for some time after it goes EOL.
>> 
>> Under the current release cadence, spark 3.0 will land some time in early 
>> 2019, which at that point will be mere months until EOL for py2.
>> 
>>> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:
>>> 
>>> 
 On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
 To be clear, is this about "python-friendly API" or "friendly python API" ?
>>> 
>>> Well what would you consider to be different between those two statements? 
>>> I think it would be good to be a bit more explicit, but I don't think we 
>>> should necessarily limit ourselves.
 
 On the python side, it might be nice to take advantage of static typing. 
 Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a 
 good opportunity to jump the python-3-only train.
>>> 
>>> I think we can make types sort of work without ditching 2 (the types only 
>>> would work in 3 but it would still function in 2). Ditching 2 entirely 
>>> would be a big thing to consider, I honestly hadn't been considering that 
>>> but it could be from just spending so much time maintaining a 2/3 code 
>>> base. I'd suggest reaching out to to user@ before making that kind of 
>>> change.
 
> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  
> wrote:
> Since we're talking about Spark 3.0 in the near future (and since some 
> recent conversation on a proposed change reminded me) I wanted to open up 
> the floor and see if folks have any ideas on how we could make a more 
> Python friendly API for 3.0? I'm planning on taking some time to look at 
> other systems in the solution space and see what we might want to learn 
> from them but I'd love to hear what other folks are thinking too.
> 
> -- 
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 
> https://amzn.to/2MaRAG9 
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
 
>> 


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Reynold Xin
we can also declare python 2 as deprecated and drop it in 3.x, not
necessarily 3.0.

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 10:33 AM Erik Erlandson  wrote:

> I am probably splitting hairs to finely, but I was considering the
> difference between improvements to the jvm-side (py4j and the scala/java
> code) that would make it easier to write the python layer ("python-friendly
> api"), and actual improvements to the python layers ("friendly python api").
>
> They're not mutually exclusive of course, and both worth working on. But
> it's *possible* to improve either without the other.
>
> Stub files look like a great solution for type annotations, maybe even if
> only python 3 is supported.
>
> I definitely agree that any decision to drop python 2 should not be taken
> lightly. Anecdotally, I'm seeing an increase in python developers
> announcing that they are dropping support for python 2 (and loving it). As
> people have already pointed out, if we don't drop python 2 for spark 3.0,
> we're stuck with it until 4.0, which would place spark in a
> possibly-awkward position of supporting python 2 for some time after it
> goes EOL.
>
> Under the current release cadence, spark 3.0 will land some time in early
> 2019, which at that point will be mere months until EOL for py2.
>
> On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau 
> wrote:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have any ideas on how we could make a more
 Python friendly API for 3.0? I'm planning on taking some time to look at
 other systems in the solution space and see what we might want to learn
 from them but I'd love to hear what other folks are thinking too.

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Erik Erlandson
I am probably splitting hairs to finely, but I was considering the
difference between improvements to the jvm-side (py4j and the scala/java
code) that would make it easier to write the python layer ("python-friendly
api"), and actual improvements to the python layers ("friendly python api").

They're not mutually exclusive of course, and both worth working on. But
it's *possible* to improve either without the other.

Stub files look like a great solution for type annotations, maybe even if
only python 3 is supported.

I definitely agree that any decision to drop python 2 should not be taken
lightly. Anecdotally, I'm seeing an increase in python developers
announcing that they are dropping support for python 2 (and loving it). As
people have already pointed out, if we don't drop python 2 for spark 3.0,
we're stuck with it until 4.0, which would place spark in a
possibly-awkward position of supporting python 2 for some time after it
goes EOL.

Under the current release cadence, spark 3.0 will land some time in early
2019, which at that point will be mere months until EOL for py2.

On Fri, Sep 14, 2018 at 5:01 PM, Holden Karau  wrote:

>
>
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>
>> To be clear, is this about "python-friendly API" or "friendly python API"
>> ?
>>
> Well what would you consider to be different between those two statements?
> I think it would be good to be a bit more explicit, but I don't think we
> should necessarily limit ourselves.
>
>>
>> On the python side, it might be nice to take advantage of static typing.
>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>> good opportunity to jump the python-3-only train.
>>
> I think we can make types sort of work without ditching 2 (the types only
> would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
>
>>
>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>> wrote:
>>
>>> Since we're talking about Spark 3.0 in the near future (and since some
>>> recent conversation on a proposed change reminded me) I wanted to open up
>>> the floor and see if folks have any ideas on how we could make a more
>>> Python friendly API for 3.0? I'm planning on taking some time to look at
>>> other systems in the solution space and see what we might want to learn
>>> from them but I'd love to hear what other folks are thinking too.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/
>>> 2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Leif Walsh
Hey there,

Here’s something I proposed recently that’s in this space.
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-24258

It’s motivated by working with a user who wanted to do some custom
statistics for which they could write the numpy code, and knew in what
dimensions they could parallelize it, but in actually getting it running,
the type system really got in the way.


On Fri, Sep 14, 2018 at 15:15 Holden Karau  wrote:

> Since we're talking about Spark 3.0 in the near future (and since some
> recent conversation on a proposed change reminded me) I wanted to open up
> the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
-- 
-- 
Cheers,
Leif


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
For the reference I raised question of Python 2 support before -
http://apache-spark-developers-list.1001551.n3.nabble.com/Future-of-the-Python-2-support-td20094.html



On Sat, 15 Sep 2018 at 15:14, Alexander Shorin  wrote:

> What's the release due for Apache Spark 3.0? Will it be tomorrow or
> somewhere at the middle of 2019 year?
>
> I think we shouldn't care much about Python 2.x today, since quite
> soon it support turns into pumpkin. For today's projects I hope nobody
> takes into account support of 2.7 unless there is some legacy still to
> carry on, but do we want to take that baggage into Apache Spark 3.x
> era? The next time you may drop it would be only 4.0 release because
> of breaking change.
>
> --
> ,,,^..^,,,
> On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz
>  wrote:
> >
> > There is no need to ditch Python 2. There are basically two options
> >
> > Use stub files and limit yourself to support only Python 3 support.
> Python 3 users benefit from type hints, Python 2 users don't, but no core
> functionality is affected. This is the approach I've used with
> https://github.com/zero323/pyspark-stubs/.
> > Use comment based inline syntax or stub files and don't use backward
> incompatible features (primarily typing module -
> https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
> supported, but more advanced components are not. Small win for Python 2
> users, moderate loss for Python 3 users.
> >
> >
> >
> > On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
> >>
> >> Do we need to ditch Python 2 support to provide type hints? I don’t
> think so.
> >>
> >> Python lets you specify typing stubs that provide the same benefit
> without forcing Python 3.
> >>
> >> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
> >>>
> >>>
> >>>
> >>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson 
> wrote:
> 
>  To be clear, is this about "python-friendly API" or "friendly python
> API" ?
> >>>
> >>> Well what would you consider to be different between those two
> statements? I think it would be good to be a bit more explicit, but I don't
> think we should necessarily limit ourselves.
> 
> 
>  On the python side, it might be nice to take advantage of static
> typing. Requires python 3.6 but with python 2 going EOL, a spark-3.0 might
> be a good opportunity to jump the python-3-only train.
> >>>
> >>> I think we can make types sort of work without ditching 2 (the types
> only would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
> 
> 
>  On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
> wrote:
> >
> > Since we're talking about Spark 3.0 in the near future (and since
> some recent conversation on a proposed change reminded me) I wanted to open
> up the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
> >
> > --
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> 
> 
> >
> >
>


Re: Python friendly API for Spark 3.0

2018-09-15 Thread Alexander Shorin
What's the release due for Apache Spark 3.0? Will it be tomorrow or
somewhere at the middle of 2019 year?

I think we shouldn't care much about Python 2.x today, since quite
soon it support turns into pumpkin. For today's projects I hope nobody
takes into account support of 2.7 unless there is some legacy still to
carry on, but do we want to take that baggage into Apache Spark 3.x
era? The next time you may drop it would be only 4.0 release because
of breaking change.

--
,,,^..^,,,
On Sat, Sep 15, 2018 at 2:21 PM Maciej Szymkiewicz
 wrote:
>
> There is no need to ditch Python 2. There are basically two options
>
> Use stub files and limit yourself to support only Python 3 support. Python 3 
> users benefit from type hints, Python 2 users don't, but no core 
> functionality is affected. This is the approach I've used with 
> https://github.com/zero323/pyspark-stubs/.
> Use comment based inline syntax or stub files and don't use backward 
> incompatible features (primarily typing module - 
> https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is 
> supported, but more advanced components are not. Small win for Python 2 
> users, moderate loss for Python 3 users.
>
>
>
> On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas  
> wrote:
>>
>> Do we need to ditch Python 2 support to provide type hints? I don’t think so.
>>
>> Python lets you specify typing stubs that provide the same benefit without 
>> forcing Python 3.
>>
>> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
>>>
>>>
>>>
>>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:

 To be clear, is this about "python-friendly API" or "friendly python API" ?
>>>
>>> Well what would you consider to be different between those two statements? 
>>> I think it would be good to be a bit more explicit, but I don't think we 
>>> should necessarily limit ourselves.


 On the python side, it might be nice to take advantage of static typing. 
 Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a 
 good opportunity to jump the python-3-only train.
>>>
>>> I think we can make types sort of work without ditching 2 (the types only 
>>> would work in 3 but it would still function in 2). Ditching 2 entirely 
>>> would be a big thing to consider, I honestly hadn't been considering that 
>>> but it could be from just spending so much time maintaining a 2/3 code 
>>> base. I'd suggest reaching out to to user@ before making that kind of 
>>> change.


 On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  
 wrote:
>
> Since we're talking about Spark 3.0 in the near future (and since some 
> recent conversation on a proposed change reminded me) I wanted to open up 
> the floor and see if folks have any ideas on how we could make a more 
> Python friendly API for 3.0? I'm planning on taking some time to look at 
> other systems in the solution space and see what we might want to learn 
> from them but I'd love to hear what other folks are thinking too.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): 
> https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau


>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Python friendly API for Spark 3.0

2018-09-15 Thread Maciej Szymkiewicz
There is no need to ditch Python 2. There are basically two options

   - Use stub files and limit yourself to support only Python 3 support.
   Python 3 users benefit from type hints, Python 2 users don't, but no core
   functionality is affected. This is the approach I've used with
   https://github.com/zero323/pyspark-stubs/.
   - Use comment based inline syntax or stub files and don't use backward
   incompatible features (primarily typing module -
   https://docs.python.org/3/library/typing.html). Both Python 2 and 3 is
   supported, but more advanced components are not. Small win for Python 2
   users, moderate loss for Python 3 users.



On Sat, 15 Sep 2018 at 02:38, Nicholas Chammas 
wrote:

> Do we need to ditch Python 2 support to provide type hints? I don’t think
> so.
>
> Python lets you specify typing stubs that provide the same benefit without
> forcing Python 3.
>
> 2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:
>
>>
>>
>> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>>
>>> To be clear, is this about "python-friendly API" or "friendly python
>>> API" ?
>>>
>> Well what would you consider to be different between those two
>> statements? I think it would be good to be a bit more explicit, but I don't
>> think we should necessarily limit ourselves.
>>
>>>
>>> On the python side, it might be nice to take advantage of static typing.
>>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>>> good opportunity to jump the python-3-only train.
>>>
>> I think we can make types sort of work without ditching 2 (the types only
>> would work in 3 but it would still function in 2). Ditching 2 entirely
>> would be a big thing to consider, I honestly hadn't been considering that
>> but it could be from just spending so much time maintaining a 2/3 code
>> base. I'd suggest reaching out to to user@ before making that kind of
>> change.
>>
>>>
>>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>>> wrote:
>>>
 Since we're talking about Spark 3.0 in the near future (and since some
 recent conversation on a proposed change reminded me) I wanted to open up
 the floor and see if folks have any ideas on how we could make a more
 Python friendly API for 3.0? I'm planning on taking some time to look at
 other systems in the solution space and see what we might want to learn
 from them but I'd love to hear what other folks are thinking too.

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>>


Re: Python friendly API for Spark 3.0

2018-09-14 Thread Nicholas Chammas
Do we need to ditch Python 2 support to provide type hints? I don’t think
so.

Python lets you specify typing stubs that provide the same benefit without
forcing Python 3.

2018년 9월 14일 (금) 오후 8:01, Holden Karau 님이 작성:

>
>
> On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:
>
>> To be clear, is this about "python-friendly API" or "friendly python API"
>> ?
>>
> Well what would you consider to be different between those two statements?
> I think it would be good to be a bit more explicit, but I don't think we
> should necessarily limit ourselves.
>
>>
>> On the python side, it might be nice to take advantage of static typing.
>> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
>> good opportunity to jump the python-3-only train.
>>
> I think we can make types sort of work without ditching 2 (the types only
> would work in 3 but it would still function in 2). Ditching 2 entirely
> would be a big thing to consider, I honestly hadn't been considering that
> but it could be from just spending so much time maintaining a 2/3 code
> base. I'd suggest reaching out to to user@ before making that kind of
> change.
>
>>
>> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
>> wrote:
>>
>>> Since we're talking about Spark 3.0 in the near future (and since some
>>> recent conversation on a proposed change reminded me) I wanted to open up
>>> the floor and see if folks have any ideas on how we could make a more
>>> Python friendly API for 3.0? I'm planning on taking some time to look at
>>> other systems in the solution space and see what we might want to learn
>>> from them but I'd love to hear what other folks are thinking too.
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>


Re: Python friendly API for Spark 3.0

2018-09-14 Thread Holden Karau
On Fri, Sep 14, 2018, 3:26 PM Erik Erlandson  wrote:

> To be clear, is this about "python-friendly API" or "friendly python API" ?
>
Well what would you consider to be different between those two statements?
I think it would be good to be a bit more explicit, but I don't think we
should necessarily limit ourselves.

>
> On the python side, it might be nice to take advantage of static typing.
> Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
> good opportunity to jump the python-3-only train.
>
I think we can make types sort of work without ditching 2 (the types only
would work in 3 but it would still function in 2). Ditching 2 entirely
would be a big thing to consider, I honestly hadn't been considering that
but it could be from just spending so much time maintaining a 2/3 code
base. I'd suggest reaching out to to user@ before making that kind of
change.

>
> On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau 
> wrote:
>
>> Since we're talking about Spark 3.0 in the near future (and since some
>> recent conversation on a proposed change reminded me) I wanted to open up
>> the floor and see if folks have any ideas on how we could make a more
>> Python friendly API for 3.0? I'm planning on taking some time to look at
>> other systems in the solution space and see what we might want to learn
>> from them but I'd love to hear what other folks are thinking too.
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
>


Re: Python friendly API for Spark 3.0

2018-09-14 Thread Erik Erlandson
To be clear, is this about "python-friendly API" or "friendly python API" ?

On the python side, it might be nice to take advantage of static typing.
Requires python 3.6 but with python 2 going EOL, a spark-3.0 might be a
good opportunity to jump the python-3-only train.

On Fri, Sep 14, 2018 at 12:15 PM, Holden Karau  wrote:

> Since we're talking about Spark 3.0 in the near future (and since some
> recent conversation on a proposed change reminded me) I wanted to open up
> the floor and see if folks have any ideas on how we could make a more
> Python friendly API for 3.0? I'm planning on taking some time to look at
> other systems in the solution space and see what we might want to learn
> from them but I'd love to hear what other folks are thinking too.
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/
> 2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>