Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-10 Thread Jean-Baptiste Onofré

I think so ;)

Regards
JB

On 11/10/2017 09:29 AM, Reuven Lax wrote:

Sounds good. I doubt we will have much opposition from users, in which case
Beam 2.3.0 can deprecate Spark 1.x

On Thu, Nov 9, 2017 at 11:54 PM, Jean-Baptiste Onofré 
wrote:


Hi all,

thanks a lot for all your feedback.

The trend is about to upgrade to Spark 2.x and drop Spark 1.x support.

However, some of you (especially Reuven and Robert) commented that users
have to be pinged as well. It makes perfect sense, and it was my intention.

I propose the following action plan:
- from the technical front, currently, I have two private branches ready:
one with Spark 1.x & Spark 2.x support (with a common module and three
artifacts), another one with an upgrade to Spark 2.x (dropping 1.x). I will
merge the later on the PR.
- I will forward the vote e-mail to the user mailing list, hopefully we
will have user feedback.

Thanks again,
Regards
JB


On 11/08/2017 08:27 AM, Jean-Baptiste Onofré wrote:


Hi all,

as you might know, we are working on Spark 2.x support in the Spark
runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code
standpoint, but I have to deal with dependencies. It's the first step of
the update as I'm still using RDD, the second step would be to support
dataframe (but for that, I would need PCollection elements with schemas,
that's another topic on which Eugene, Reuven and I are discussing).

However, as all major distributions now ship Spark 2.x, I don't think
it's required anymore to support Spark 1.x.

If we agree, I will update and cleanup the PR to only support and focus
on Spark 2.x.

So, that's why I'm calling for a vote:

[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
[ ] 0 (I don't care ;))
[ ] -1, I would like to still support Spark 1.x, and so having support
of both Spark 1.x and 2.x (please provide specific comment)

This vote is open for 48 hours (I have the commits ready, just waiting
the end of the vote to push on the PR).

Thanks !
Regards
JB



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-10 Thread Reuven Lax
Sounds good. I doubt we will have much opposition from users, in which case
Beam 2.3.0 can deprecate Spark 1.x

On Thu, Nov 9, 2017 at 11:54 PM, Jean-Baptiste Onofré 
wrote:

> Hi all,
>
> thanks a lot for all your feedback.
>
> The trend is about to upgrade to Spark 2.x and drop Spark 1.x support.
>
> However, some of you (especially Reuven and Robert) commented that users
> have to be pinged as well. It makes perfect sense, and it was my intention.
>
> I propose the following action plan:
> - from the technical front, currently, I have two private branches ready:
> one with Spark 1.x & Spark 2.x support (with a common module and three
> artifacts), another one with an upgrade to Spark 2.x (dropping 1.x). I will
> merge the later on the PR.
> - I will forward the vote e-mail to the user mailing list, hopefully we
> will have user feedback.
>
> Thanks again,
> Regards
> JB
>
>
> On 11/08/2017 08:27 AM, Jean-Baptiste Onofré wrote:
>
>> Hi all,
>>
>> as you might know, we are working on Spark 2.x support in the Spark
>> runner.
>>
>> I'm working on a PR about that:
>>
>> https://github.com/apache/beam/pull/3808
>>
>> Today, we have something working with both Spark 1.x and 2.x from a code
>> standpoint, but I have to deal with dependencies. It's the first step of
>> the update as I'm still using RDD, the second step would be to support
>> dataframe (but for that, I would need PCollection elements with schemas,
>> that's another topic on which Eugene, Reuven and I are discussing).
>>
>> However, as all major distributions now ship Spark 2.x, I don't think
>> it's required anymore to support Spark 1.x.
>>
>> If we agree, I will update and cleanup the PR to only support and focus
>> on Spark 2.x.
>>
>> So, that's why I'm calling for a vote:
>>
>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>>[ ] 0 (I don't care ;))
>>[ ] -1, I would like to still support Spark 1.x, and so having support
>> of both Spark 1.x and 2.x (please provide specific comment)
>>
>> This vote is open for 48 hours (I have the commits ready, just waiting
>> the end of the vote to push on the PR).
>>
>> Thanks !
>> Regards
>> JB
>>
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Jean-Baptiste Onofré

Hi all,

thanks a lot for all your feedback.

The trend is about to upgrade to Spark 2.x and drop Spark 1.x support.

However, some of you (especially Reuven and Robert) commented that users have to 
be pinged as well. It makes perfect sense, and it was my intention.


I propose the following action plan:
- from the technical front, currently, I have two private branches ready: one 
with Spark 1.x & Spark 2.x support (with a common module and three artifacts), 
another one with an upgrade to Spark 2.x (dropping 1.x). I will merge the later 
on the PR.
- I will forward the vote e-mail to the user mailing list, hopefully we will 
have user feedback.


Thanks again,
Regards
JB

On 11/08/2017 08:27 AM, Jean-Baptiste Onofré wrote:

Hi all,

as you might know, we are working on Spark 2.x support in the Spark runner.

I'm working on a PR about that:

https://github.com/apache/beam/pull/3808

Today, we have something working with both Spark 1.x and 2.x from a code 
standpoint, but I have to deal with dependencies. It's the first step of the 
update as I'm still using RDD, the second step would be to support dataframe 
(but for that, I would need PCollection elements with schemas, that's another 
topic on which Eugene, Reuven and I are discussing).


However, as all major distributions now ship Spark 2.x, I don't think it's 
required anymore to support Spark 1.x.


If we agree, I will update and cleanup the PR to only support and focus on Spark 
2.x.


So, that's why I'm calling for a vote:

   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
   [ ] 0 (I don't care ;))
   [ ] -1, I would like to still support Spark 1.x, and so having support of 
both Spark 1.x and 2.x (please provide specific comment)


This vote is open for 48 hours (I have the commits ready, just waiting the end 
of the vote to push on the PR).


Thanks !
Regards
JB


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Robert Bradshaw
On Thu, Nov 9, 2017 at 11:05 AM, Kenneth Knowles  
wrote:
> I think it makes sense to communicate with email to users@ and in the
> release notes of 2.2.0.

Totally agree.

> That communication should be specific and indicate
> whether we are planning to merely not work on it anymore or actually remove
> it in 2.3.0.

There seems to be some ambiguity in this vote which of these two
options we're actually considering. I'm certainly +1 on relegating it
to maintenance mode at least. I don't have a good sense on the burden
of keeping it around, nor the number of potential (current?) users
we'd be alienating, which seem to be the driving factors. The fact
that all major distributions ship 2.x is very different than the
question of whether most users have migrated to 2.x.

> On Thu, Nov 9, 2017 at 6:35 AM, Amit Sela  wrote:
>
>> +1 for dropping Spark 1 support.
>> I don't think we have enough users to justify supporting both, and its been
>> a long time since this idea originally came-up (when Spark2 wasn't stable)
>> and now Spark 2 is standard in all Hadoop distros.
>> As for switching to the Dataframe API, as long as Spark 2 doesn't support
>> scanning through the state periodically (even if no data for a key),
>> watermarks won't fire keys that didn't see updates.
>>
>> On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise  wrote:
>>
>> > +1 (non-binding) for dropping 1.x support
>> >
>> > I don't have the impression that there is significant adoption for Beam
>> on
>> > Spark 1.x ? A stronger Spark runner that works well on 2.x will be better
>> > for Beam adoption than a runner that has to compromise due to 1.x
>> baggage.
>> > Development efforts can go into improving the runner.
>> >
>> > Thanks,
>> > Thomas
>> >
>> >
>> > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy <
>> srinivas96all...@gmail.com
>> > >
>> > wrote:
>> >
>> > > +1
>> > >
>> > >
>> > >
>> > > --
>> > > Srinivas Reddy
>> > >
>> > > http://mrsrinivas.com/
>> > >
>> > >
>> > > (Sent via gmail web)
>> > >
>> > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré 
>> > wrote:
>> > >
>> > > > Hi all,
>> > > >
>> > > > as you might know, we are working on Spark 2.x support in the Spark
>> > > runner.
>> > > >
>> > > > I'm working on a PR about that:
>> > > >
>> > > > https://github.com/apache/beam/pull/3808
>> > > >
>> > > > Today, we have something working with both Spark 1.x and 2.x from a
>> > code
>> > > > standpoint, but I have to deal with dependencies. It's the first step
>> > of
>> > > > the update as I'm still using RDD, the second step would be to
>> support
>> > > > dataframe (but for that, I would need PCollection elements with
>> > schemas,
>> > > > that's another topic on which Eugene, Reuven and I are discussing).
>> > > >
>> > > > However, as all major distributions now ship Spark 2.x, I don't think
>> > > it's
>> > > > required anymore to support Spark 1.x.
>> > > >
>> > > > If we agree, I will update and cleanup the PR to only support and
>> focus
>> > > on
>> > > > Spark 2.x.
>> > > >
>> > > > So, that's why I'm calling for a vote:
>> > > >
>> > > >   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>> > > >   [ ] 0 (I don't care ;))
>> > > >   [ ] -1, I would like to still support Spark 1.x, and so having
>> > support
>> > > > of both Spark 1.x and 2.x (please provide specific comment)
>> > > >
>> > > > This vote is open for 48 hours (I have the commits ready, just
>> waiting
>> > > the
>> > > > end of the vote to push on the PR).
>> > > >
>> > > > Thanks !
>> > > > Regards
>> > > > JB
>> > > > --
>> > > > Jean-Baptiste Onofré
>> > > > jbono...@apache.org
>> > > > http://blog.nanthrax.net
>> > > > Talend - http://www.talend.com
>> > > >
>> > >
>> >
>>


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Reuven Lax
+1 from me. However let's notify users@ first. If we do get a lot of
pushback from users (which I doubt we will), we might reconsider dropping
Spark 1 support.

On Thu, Nov 9, 2017 at 11:05 AM, Kenneth Knowles 
wrote:

> +1 from me, with a friendly deprecation process
>
> I am convinced by the following:
>
>  - We don't have the resources to make both great, and anyhow it isn't
> worth it
>  - People keeping up with Beam releases are likely to be keeping up with
> Spark as well
>  - Spark 1 users already have a Spark 1 runner for Beam and can keep using
> it (and we don't actually lose the ability to update it in a pinch)
>  - Key features like portability (hence Python) will be some time so we
> should definitely not waste effort building that feature with Spark 1 in
> mind
>
> I think it makes sense to communicate with email to users@ and in the
> release notes of 2.2.0. That communication should be specific and indicate
> whether we are planning to merely not work on it anymore or actually remove
> it in 2.3.0.
>
> Kenn
>
> On Thu, Nov 9, 2017 at 6:35 AM, Amit Sela  wrote:
>
> > +1 for dropping Spark 1 support.
> > I don't think we have enough users to justify supporting both, and its
> been
> > a long time since this idea originally came-up (when Spark2 wasn't
> stable)
> > and now Spark 2 is standard in all Hadoop distros.
> > As for switching to the Dataframe API, as long as Spark 2 doesn't support
> > scanning through the state periodically (even if no data for a key),
> > watermarks won't fire keys that didn't see updates.
> >
> > On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise  wrote:
> >
> > > +1 (non-binding) for dropping 1.x support
> > >
> > > I don't have the impression that there is significant adoption for Beam
> > on
> > > Spark 1.x ? A stronger Spark runner that works well on 2.x will be
> better
> > > for Beam adoption than a runner that has to compromise due to 1.x
> > baggage.
> > > Development efforts can go into improving the runner.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy <
> > srinivas96all...@gmail.com
> > > >
> > > wrote:
> > >
> > > > +1
> > > >
> > > >
> > > >
> > > > --
> > > > Srinivas Reddy
> > > >
> > > > http://mrsrinivas.com/
> > > >
> > > >
> > > > (Sent via gmail web)
> > > >
> > > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré 
> > > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > as you might know, we are working on Spark 2.x support in the Spark
> > > > runner.
> > > > >
> > > > > I'm working on a PR about that:
> > > > >
> > > > > https://github.com/apache/beam/pull/3808
> > > > >
> > > > > Today, we have something working with both Spark 1.x and 2.x from a
> > > code
> > > > > standpoint, but I have to deal with dependencies. It's the first
> step
> > > of
> > > > > the update as I'm still using RDD, the second step would be to
> > support
> > > > > dataframe (but for that, I would need PCollection elements with
> > > schemas,
> > > > > that's another topic on which Eugene, Reuven and I are discussing).
> > > > >
> > > > > However, as all major distributions now ship Spark 2.x, I don't
> think
> > > > it's
> > > > > required anymore to support Spark 1.x.
> > > > >
> > > > > If we agree, I will update and cleanup the PR to only support and
> > focus
> > > > on
> > > > > Spark 2.x.
> > > > >
> > > > > So, that's why I'm calling for a vote:
> > > > >
> > > > >   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
> > > > >   [ ] 0 (I don't care ;))
> > > > >   [ ] -1, I would like to still support Spark 1.x, and so having
> > > support
> > > > > of both Spark 1.x and 2.x (please provide specific comment)
> > > > >
> > > > > This vote is open for 48 hours (I have the commits ready, just
> > waiting
> > > > the
> > > > > end of the vote to push on the PR).
> > > > >
> > > > > Thanks !
> > > > > Regards
> > > > > JB
> > > > > --
> > > > > Jean-Baptiste Onofré
> > > > > jbono...@apache.org
> > > > > http://blog.nanthrax.net
> > > > > Talend - http://www.talend.com
> > > > >
> > > >
> > >
> >
>


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Kenneth Knowles
+1 from me, with a friendly deprecation process

I am convinced by the following:

 - We don't have the resources to make both great, and anyhow it isn't
worth it
 - People keeping up with Beam releases are likely to be keeping up with
Spark as well
 - Spark 1 users already have a Spark 1 runner for Beam and can keep using
it (and we don't actually lose the ability to update it in a pinch)
 - Key features like portability (hence Python) will be some time so we
should definitely not waste effort building that feature with Spark 1 in
mind

I think it makes sense to communicate with email to users@ and in the
release notes of 2.2.0. That communication should be specific and indicate
whether we are planning to merely not work on it anymore or actually remove
it in 2.3.0.

Kenn

On Thu, Nov 9, 2017 at 6:35 AM, Amit Sela  wrote:

> +1 for dropping Spark 1 support.
> I don't think we have enough users to justify supporting both, and its been
> a long time since this idea originally came-up (when Spark2 wasn't stable)
> and now Spark 2 is standard in all Hadoop distros.
> As for switching to the Dataframe API, as long as Spark 2 doesn't support
> scanning through the state periodically (even if no data for a key),
> watermarks won't fire keys that didn't see updates.
>
> On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise  wrote:
>
> > +1 (non-binding) for dropping 1.x support
> >
> > I don't have the impression that there is significant adoption for Beam
> on
> > Spark 1.x ? A stronger Spark runner that works well on 2.x will be better
> > for Beam adoption than a runner that has to compromise due to 1.x
> baggage.
> > Development efforts can go into improving the runner.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy <
> srinivas96all...@gmail.com
> > >
> > wrote:
> >
> > > +1
> > >
> > >
> > >
> > > --
> > > Srinivas Reddy
> > >
> > > http://mrsrinivas.com/
> > >
> > >
> > > (Sent via gmail web)
> > >
> > > On 8 November 2017 at 14:27, Jean-Baptiste Onofré 
> > wrote:
> > >
> > > > Hi all,
> > > >
> > > > as you might know, we are working on Spark 2.x support in the Spark
> > > runner.
> > > >
> > > > I'm working on a PR about that:
> > > >
> > > > https://github.com/apache/beam/pull/3808
> > > >
> > > > Today, we have something working with both Spark 1.x and 2.x from a
> > code
> > > > standpoint, but I have to deal with dependencies. It's the first step
> > of
> > > > the update as I'm still using RDD, the second step would be to
> support
> > > > dataframe (but for that, I would need PCollection elements with
> > schemas,
> > > > that's another topic on which Eugene, Reuven and I are discussing).
> > > >
> > > > However, as all major distributions now ship Spark 2.x, I don't think
> > > it's
> > > > required anymore to support Spark 1.x.
> > > >
> > > > If we agree, I will update and cleanup the PR to only support and
> focus
> > > on
> > > > Spark 2.x.
> > > >
> > > > So, that's why I'm calling for a vote:
> > > >
> > > >   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
> > > >   [ ] 0 (I don't care ;))
> > > >   [ ] -1, I would like to still support Spark 1.x, and so having
> > support
> > > > of both Spark 1.x and 2.x (please provide specific comment)
> > > >
> > > > This vote is open for 48 hours (I have the commits ready, just
> waiting
> > > the
> > > > end of the vote to push on the PR).
> > > >
> > > > Thanks !
> > > > Regards
> > > > JB
> > > > --
> > > > Jean-Baptiste Onofré
> > > > jbono...@apache.org
> > > > http://blog.nanthrax.net
> > > > Talend - http://www.talend.com
> > > >
> > >
> >
>


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Amit Sela
+1 for dropping Spark 1 support.
I don't think we have enough users to justify supporting both, and its been
a long time since this idea originally came-up (when Spark2 wasn't stable)
and now Spark 2 is standard in all Hadoop distros.
As for switching to the Dataframe API, as long as Spark 2 doesn't support
scanning through the state periodically (even if no data for a key),
watermarks won't fire keys that didn't see updates.

On Thu, Nov 9, 2017 at 9:12 AM Thomas Weise  wrote:

> +1 (non-binding) for dropping 1.x support
>
> I don't have the impression that there is significant adoption for Beam on
> Spark 1.x ? A stronger Spark runner that works well on 2.x will be better
> for Beam adoption than a runner that has to compromise due to 1.x baggage.
> Development efforts can go into improving the runner.
>
> Thanks,
> Thomas
>
>
> On Thu, Nov 9, 2017 at 4:08 AM, Srinivas Reddy  >
> wrote:
>
> > +1
> >
> >
> >
> > --
> > Srinivas Reddy
> >
> > http://mrsrinivas.com/
> >
> >
> > (Sent via gmail web)
> >
> > On 8 November 2017 at 14:27, Jean-Baptiste Onofré 
> wrote:
> >
> > > Hi all,
> > >
> > > as you might know, we are working on Spark 2.x support in the Spark
> > runner.
> > >
> > > I'm working on a PR about that:
> > >
> > > https://github.com/apache/beam/pull/3808
> > >
> > > Today, we have something working with both Spark 1.x and 2.x from a
> code
> > > standpoint, but I have to deal with dependencies. It's the first step
> of
> > > the update as I'm still using RDD, the second step would be to support
> > > dataframe (but for that, I would need PCollection elements with
> schemas,
> > > that's another topic on which Eugene, Reuven and I are discussing).
> > >
> > > However, as all major distributions now ship Spark 2.x, I don't think
> > it's
> > > required anymore to support Spark 1.x.
> > >
> > > If we agree, I will update and cleanup the PR to only support and focus
> > on
> > > Spark 2.x.
> > >
> > > So, that's why I'm calling for a vote:
> > >
> > >   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
> > >   [ ] 0 (I don't care ;))
> > >   [ ] -1, I would like to still support Spark 1.x, and so having
> support
> > > of both Spark 1.x and 2.x (please provide specific comment)
> > >
> > > This vote is open for 48 hours (I have the commits ready, just waiting
> > the
> > > end of the vote to push on the PR).
> > >
> > > Thanks !
> > > Regards
> > > JB
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-09 Thread Ismaël Mejía
+1 for the move to Spark 2 modulo preventing users and deciding on support:

I agree that having compatibility for both versions of Spark is
desirable but I am not sure if is worth the effort. Apart of the
reasons mentioned by Holden and Pei, I will add that the burden of
simultaneous maintenance could be bigger than the return, and also
that most Big Data/Cloud distributions have moved already to Spark 2,
so it makes sense to prioritize the new users better than the legacy
ones, in particular if we consider that Beam is a ‘recent’ project.

We can announce the end of the support for Spark 1 in the release
notes of Beam 2.2 and decide if we will support it in maintenance
mode, in this case we will backport or fix any reported issue related
to the Spark 1 runner on the 2.2.x branch let’s say for a year, but we
won’t add new functionalities. Or we can just decide not to support it
anymore and encourage users to move to Spark 2.

On Thu, Nov 9, 2017 at 6:59 AM, Pei HE  wrote:
> +1 on moving forward with Spark 2.x only.
> Spark 1 users can still use already released Spark runners, and we can
> support them with minor version releases for future bug fixes.
>
> I don't see how important it is to make future Beam releases available to
> Spark 1 users. If they choose not to upgrade Spark clusters, maybe they
> don't need the newest Beam releases as well.
>
> I think it is more important to 1). be able to leverage new features in
> Spark 2.x, 2.) extend user base to Spark 2.
> --
> Pei
>
>
> On Thu, Nov 9, 2017 at 1:45 PM, Holden Karau  wrote:
>
>> That's a good point about Oozie does only supporting only Spark 1 or 2 at a
>> time on a cluster -- but do we know people using Oozie and Spark 1 that
>> would still be using Spark 1 by the time of the next BEAM release? The last
>> Spark 1 release was a year ago (and last non-maintenance release almost 20
>> months ago).
>>
>> On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick  wrote:
>>
>> > I don't know if ditching Spark 1 out right right now would be a great
>> move
>> > given that a lot of the main support applications around spark haven't
>> yet
>> > fully moved to Spark 2 yet. Yet alone have support for having a cluster
>> > with both. Oozie for example is still pre stable release for their Spark
>> 1
>> > and can't support a cluster with mixed Spark version. I think maybe doing
>> > as suggested above with the common, spark1, spark2 packaging might be
>> best
>> > during this carry over phase. Maybe even just flag spark 1 as deprecated
>> > and just being maintained might be enough.
>> >
>> > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau 
>> > wrote:
>> >
>> > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM
>> > > versions. For folks using YARN or the hosted environments it pretty
>> much
>> > > trivial since you can effectively have distinct Spark clusters for each
>> > > job.
>> > >
>> > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau 
>> > wrote:
>> > >
>> > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements
>> in
>> > > > Spark 2, and trying to write efficient code that runs between Spark 1
>> > and
>> > > > Spark 2 is super painful in the long term. It would be one thing if
>> > there
>> > > > were a lot of people available to work on the Spark runners, but it
>> > seems
>> > > > like we'd be better spent focusing our energy on the future.
>> > > >
>> > > > I don't know a lot of folks who are stuck on Spark 1, and the few
>> that
>> > I
>> > > > know are planning to migrate in the next few months anyways.
>> > > >
>> > > > Note: this is a non-binding vote as I'm not a committer or PMC
>> member.
>> > > >
>> > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu  wrote:
>> > > >
>> > > >> Having both Spark1 and Spark2 modules would benefit wider user base.
>> > > >>
>> > > >> I would vote for that.
>> > > >>
>> > > >> Cheers
>> > > >>
>> > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <
>> > j...@nanthrax.net>
>> > > >> wrote:
>> > > >>
>> > > >> > Hi Robert,
>> > > >> >
>> > > >> > Thanks for your feedback !
>> > > >> >
>> > > >> > From an user perspective, with the current state of the PR, the
>> same
>> > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference
>> is
>> > > the
>> > > >> > dependencies set.
>> > > >> >
>> > > >> > I'm calling the vote to get suck kind of feedback: if we consider
>> > > Spark
>> > > >> > 1.x still need to be supported, no problem, I will improve the PR
>> to
>> > > >> have
>> > > >> > three modules (common, spark1, spark2) and let users pick the
>> > desired
>> > > >> > version.
>> > > >> >
>> > > >> > Let's wait a bit other feedbacks, I will update the PR
>> accordingly.
>> > > >> >
>> > > >> > Regards
>> > > >> > JB
>> > > >> >
>> > > >> >
>> > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
>> > > >> >
>> > > >> >> I'm generally a -0.5 on 

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Pei HE
+1 on moving forward with Spark 2.x only.
Spark 1 users can still use already released Spark runners, and we can
support them with minor version releases for future bug fixes.

I don't see how important it is to make future Beam releases available to
Spark 1 users. If they choose not to upgrade Spark clusters, maybe they
don't need the newest Beam releases as well.

I think it is more important to 1). be able to leverage new features in
Spark 2.x, 2.) extend user base to Spark 2.
--
Pei


On Thu, Nov 9, 2017 at 1:45 PM, Holden Karau  wrote:

> That's a good point about Oozie does only supporting only Spark 1 or 2 at a
> time on a cluster -- but do we know people using Oozie and Spark 1 that
> would still be using Spark 1 by the time of the next BEAM release? The last
> Spark 1 release was a year ago (and last non-maintenance release almost 20
> months ago).
>
> On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick  wrote:
>
> > I don't know if ditching Spark 1 out right right now would be a great
> move
> > given that a lot of the main support applications around spark haven't
> yet
> > fully moved to Spark 2 yet. Yet alone have support for having a cluster
> > with both. Oozie for example is still pre stable release for their Spark
> 1
> > and can't support a cluster with mixed Spark version. I think maybe doing
> > as suggested above with the common, spark1, spark2 packaging might be
> best
> > during this carry over phase. Maybe even just flag spark 1 as deprecated
> > and just being maintained might be enough.
> >
> > On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau 
> > wrote:
> >
> > > Also, upgrading Spark 1 to 2 is generally easier than changing JVM
> > > versions. For folks using YARN or the hosted environments it pretty
> much
> > > trivial since you can effectively have distinct Spark clusters for each
> > > job.
> > >
> > > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau 
> > wrote:
> > >
> > > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements
> in
> > > > Spark 2, and trying to write efficient code that runs between Spark 1
> > and
> > > > Spark 2 is super painful in the long term. It would be one thing if
> > there
> > > > were a lot of people available to work on the Spark runners, but it
> > seems
> > > > like we'd be better spent focusing our energy on the future.
> > > >
> > > > I don't know a lot of folks who are stuck on Spark 1, and the few
> that
> > I
> > > > know are planning to migrate in the next few months anyways.
> > > >
> > > > Note: this is a non-binding vote as I'm not a committer or PMC
> member.
> > > >
> > > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu  wrote:
> > > >
> > > >> Having both Spark1 and Spark2 modules would benefit wider user base.
> > > >>
> > > >> I would vote for that.
> > > >>
> > > >> Cheers
> > > >>
> > > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <
> > j...@nanthrax.net>
> > > >> wrote:
> > > >>
> > > >> > Hi Robert,
> > > >> >
> > > >> > Thanks for your feedback !
> > > >> >
> > > >> > From an user perspective, with the current state of the PR, the
> same
> > > >> > pipelines can run on both Spark 1.x and 2.x: the only difference
> is
> > > the
> > > >> > dependencies set.
> > > >> >
> > > >> > I'm calling the vote to get suck kind of feedback: if we consider
> > > Spark
> > > >> > 1.x still need to be supported, no problem, I will improve the PR
> to
> > > >> have
> > > >> > three modules (common, spark1, spark2) and let users pick the
> > desired
> > > >> > version.
> > > >> >
> > > >> > Let's wait a bit other feedbacks, I will update the PR
> accordingly.
> > > >> >
> > > >> > Regards
> > > >> > JB
> > > >> >
> > > >> >
> > > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
> > > >> >
> > > >> >> I'm generally a -0.5 on this change, or at least doing so
> hastily.
> > > >> >>
> > > >> >> As with dropping Java 7 support, I think this should at least be
> > > >> >> announced in release notes that we're considering dropping
> support
> > in
> > > >> >> the subsequent release, as this dev list likely does not reach a
> > > >> >> substantial portion of the userbase.
> > > >> >>
> > > >> >> How much work is it to move from a Spark 1.x cluster to a Spark
> 2.x
> > > >> >> cluster? I get the feeling it's not nearly as transparent as
> > > upgrading
> > > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x
> > clusters,
> > > >> >> or is a new cluster (and/or upgrading all pipelines) required
> (e.g.
> > > >> >> for those who operate spark clusters shared among their many
> > users)?
> > > >> >>
> > > >> >> Looks like the latest release of Spark 1.x was about a year ago,
> > > >> >> overlapping a bit with the 2.x series which is coming up on 1.5
> > years
> > > >> >> old, so I could see a lot of people still using 1.x even if 2.x
> is
> > > >> >> clearly the future. But it sure doesn't seem very backwards
> > > >> >> 

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Holden Karau
That's a good point about Oozie does only supporting only Spark 1 or 2 at a
time on a cluster -- but do we know people using Oozie and Spark 1 that
would still be using Spark 1 by the time of the next BEAM release? The last
Spark 1 release was a year ago (and last non-maintenance release almost 20
months ago).

On Wed, Nov 8, 2017 at 9:30 PM, NerdyNick  wrote:

> I don't know if ditching Spark 1 out right right now would be a great move
> given that a lot of the main support applications around spark haven't yet
> fully moved to Spark 2 yet. Yet alone have support for having a cluster
> with both. Oozie for example is still pre stable release for their Spark 1
> and can't support a cluster with mixed Spark version. I think maybe doing
> as suggested above with the common, spark1, spark2 packaging might be best
> during this carry over phase. Maybe even just flag spark 1 as deprecated
> and just being maintained might be enough.
>
> On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau 
> wrote:
>
> > Also, upgrading Spark 1 to 2 is generally easier than changing JVM
> > versions. For folks using YARN or the hosted environments it pretty much
> > trivial since you can effectively have distinct Spark clusters for each
> > job.
> >
> > On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau 
> wrote:
> >
> > > I'm +1 on dropping Spark 1. There are a lot of exciting improvements in
> > > Spark 2, and trying to write efficient code that runs between Spark 1
> and
> > > Spark 2 is super painful in the long term. It would be one thing if
> there
> > > were a lot of people available to work on the Spark runners, but it
> seems
> > > like we'd be better spent focusing our energy on the future.
> > >
> > > I don't know a lot of folks who are stuck on Spark 1, and the few that
> I
> > > know are planning to migrate in the next few months anyways.
> > >
> > > Note: this is a non-binding vote as I'm not a committer or PMC member.
> > >
> > > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu  wrote:
> > >
> > >> Having both Spark1 and Spark2 modules would benefit wider user base.
> > >>
> > >> I would vote for that.
> > >>
> > >> Cheers
> > >>
> > >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré <
> j...@nanthrax.net>
> > >> wrote:
> > >>
> > >> > Hi Robert,
> > >> >
> > >> > Thanks for your feedback !
> > >> >
> > >> > From an user perspective, with the current state of the PR, the same
> > >> > pipelines can run on both Spark 1.x and 2.x: the only difference is
> > the
> > >> > dependencies set.
> > >> >
> > >> > I'm calling the vote to get suck kind of feedback: if we consider
> > Spark
> > >> > 1.x still need to be supported, no problem, I will improve the PR to
> > >> have
> > >> > three modules (common, spark1, spark2) and let users pick the
> desired
> > >> > version.
> > >> >
> > >> > Let's wait a bit other feedbacks, I will update the PR accordingly.
> > >> >
> > >> > Regards
> > >> > JB
> > >> >
> > >> >
> > >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
> > >> >
> > >> >> I'm generally a -0.5 on this change, or at least doing so hastily.
> > >> >>
> > >> >> As with dropping Java 7 support, I think this should at least be
> > >> >> announced in release notes that we're considering dropping support
> in
> > >> >> the subsequent release, as this dev list likely does not reach a
> > >> >> substantial portion of the userbase.
> > >> >>
> > >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
> > >> >> cluster? I get the feeling it's not nearly as transparent as
> > upgrading
> > >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x
> clusters,
> > >> >> or is a new cluster (and/or upgrading all pipelines) required (e.g.
> > >> >> for those who operate spark clusters shared among their many
> users)?
> > >> >>
> > >> >> Looks like the latest release of Spark 1.x was about a year ago,
> > >> >> overlapping a bit with the 2.x series which is coming up on 1.5
> years
> > >> >> old, so I could see a lot of people still using 1.x even if 2.x is
> > >> >> clearly the future. But it sure doesn't seem very backwards
> > >> >> compatible.
> > >> >>
> > >> >> Mostly I'm not comfortable with dropping 1.x in the same release as
> > >> >> adding support for 2.x, giving no transition period, but could be
> > >> >> convinced if this transition is mostly a no-op or no one's still
> > using
> > >> >> 1.x. If there's non-trivial code complexity issues, I would perhaps
> > >> >> revisit the issue of having a single Spark Runner that does chooses
> > >> >> the backend implicitly in favor of simply having two runners which
> > >> >> share the code that's easy to share and diverge otherwise (which
> > seems
> > >> >> it would be much simpler both to implement and explain to users). I
> > >> >> would be OK with even letting the Spark 1.x runner be somewhat
> > >> >> stagnant (e.g. few or no new features) until we decide we can kill
> it

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread NerdyNick
I don't know if ditching Spark 1 out right right now would be a great move
given that a lot of the main support applications around spark haven't yet
fully moved to Spark 2 yet. Yet alone have support for having a cluster
with both. Oozie for example is still pre stable release for their Spark 1
and can't support a cluster with mixed Spark version. I think maybe doing
as suggested above with the common, spark1, spark2 packaging might be best
during this carry over phase. Maybe even just flag spark 1 as deprecated
and just being maintained might be enough.

On Wed, Nov 8, 2017 at 10:25 PM, Holden Karau  wrote:

> Also, upgrading Spark 1 to 2 is generally easier than changing JVM
> versions. For folks using YARN or the hosted environments it pretty much
> trivial since you can effectively have distinct Spark clusters for each
> job.
>
> On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau  wrote:
>
> > I'm +1 on dropping Spark 1. There are a lot of exciting improvements in
> > Spark 2, and trying to write efficient code that runs between Spark 1 and
> > Spark 2 is super painful in the long term. It would be one thing if there
> > were a lot of people available to work on the Spark runners, but it seems
> > like we'd be better spent focusing our energy on the future.
> >
> > I don't know a lot of folks who are stuck on Spark 1, and the few that I
> > know are planning to migrate in the next few months anyways.
> >
> > Note: this is a non-binding vote as I'm not a committer or PMC member.
> >
> > On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu  wrote:
> >
> >> Having both Spark1 and Spark2 modules would benefit wider user base.
> >>
> >> I would vote for that.
> >>
> >> Cheers
> >>
> >> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré 
> >> wrote:
> >>
> >> > Hi Robert,
> >> >
> >> > Thanks for your feedback !
> >> >
> >> > From an user perspective, with the current state of the PR, the same
> >> > pipelines can run on both Spark 1.x and 2.x: the only difference is
> the
> >> > dependencies set.
> >> >
> >> > I'm calling the vote to get suck kind of feedback: if we consider
> Spark
> >> > 1.x still need to be supported, no problem, I will improve the PR to
> >> have
> >> > three modules (common, spark1, spark2) and let users pick the desired
> >> > version.
> >> >
> >> > Let's wait a bit other feedbacks, I will update the PR accordingly.
> >> >
> >> > Regards
> >> > JB
> >> >
> >> >
> >> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
> >> >
> >> >> I'm generally a -0.5 on this change, or at least doing so hastily.
> >> >>
> >> >> As with dropping Java 7 support, I think this should at least be
> >> >> announced in release notes that we're considering dropping support in
> >> >> the subsequent release, as this dev list likely does not reach a
> >> >> substantial portion of the userbase.
> >> >>
> >> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
> >> >> cluster? I get the feeling it's not nearly as transparent as
> upgrading
> >> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters,
> >> >> or is a new cluster (and/or upgrading all pipelines) required (e.g.
> >> >> for those who operate spark clusters shared among their many users)?
> >> >>
> >> >> Looks like the latest release of Spark 1.x was about a year ago,
> >> >> overlapping a bit with the 2.x series which is coming up on 1.5 years
> >> >> old, so I could see a lot of people still using 1.x even if 2.x is
> >> >> clearly the future. But it sure doesn't seem very backwards
> >> >> compatible.
> >> >>
> >> >> Mostly I'm not comfortable with dropping 1.x in the same release as
> >> >> adding support for 2.x, giving no transition period, but could be
> >> >> convinced if this transition is mostly a no-op or no one's still
> using
> >> >> 1.x. If there's non-trivial code complexity issues, I would perhaps
> >> >> revisit the issue of having a single Spark Runner that does chooses
> >> >> the backend implicitly in favor of simply having two runners which
> >> >> share the code that's easy to share and diverge otherwise (which
> seems
> >> >> it would be much simpler both to implement and explain to users). I
> >> >> would be OK with even letting the Spark 1.x runner be somewhat
> >> >> stagnant (e.g. few or no new features) until we decide we can kill it
> >> >> off.
> >> >>
> >> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré <
> j...@nanthrax.net
> >> >
> >> >> wrote:
> >> >>
> >> >>> Hi all,
> >> >>>
> >> >>> as you might know, we are working on Spark 2.x support in the Spark
> >> >>> runner.
> >> >>>
> >> >>> I'm working on a PR about that:
> >> >>>
> >> >>> https://github.com/apache/beam/pull/3808
> >> >>>
> >> >>> Today, we have something working with both Spark 1.x and 2.x from a
> >> code
> >> >>> standpoint, but I have to deal with dependencies. It's the first
> step
> >> of
> >> >>> the
> >> >>> update as I'm still using RDD, 

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Holden Karau
Also, upgrading Spark 1 to 2 is generally easier than changing JVM
versions. For folks using YARN or the hosted environments it pretty much
trivial since you can effectively have distinct Spark clusters for each job.

On Wed, Nov 8, 2017 at 9:19 PM, Holden Karau  wrote:

> I'm +1 on dropping Spark 1. There are a lot of exciting improvements in
> Spark 2, and trying to write efficient code that runs between Spark 1 and
> Spark 2 is super painful in the long term. It would be one thing if there
> were a lot of people available to work on the Spark runners, but it seems
> like we'd be better spent focusing our energy on the future.
>
> I don't know a lot of folks who are stuck on Spark 1, and the few that I
> know are planning to migrate in the next few months anyways.
>
> Note: this is a non-binding vote as I'm not a committer or PMC member.
>
> On Wed, Nov 8, 2017 at 3:43 AM, Ted Yu  wrote:
>
>> Having both Spark1 and Spark2 modules would benefit wider user base.
>>
>> I would vote for that.
>>
>> Cheers
>>
>> On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> > Hi Robert,
>> >
>> > Thanks for your feedback !
>> >
>> > From an user perspective, with the current state of the PR, the same
>> > pipelines can run on both Spark 1.x and 2.x: the only difference is the
>> > dependencies set.
>> >
>> > I'm calling the vote to get suck kind of feedback: if we consider Spark
>> > 1.x still need to be supported, no problem, I will improve the PR to
>> have
>> > three modules (common, spark1, spark2) and let users pick the desired
>> > version.
>> >
>> > Let's wait a bit other feedbacks, I will update the PR accordingly.
>> >
>> > Regards
>> > JB
>> >
>> >
>> > On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
>> >
>> >> I'm generally a -0.5 on this change, or at least doing so hastily.
>> >>
>> >> As with dropping Java 7 support, I think this should at least be
>> >> announced in release notes that we're considering dropping support in
>> >> the subsequent release, as this dev list likely does not reach a
>> >> substantial portion of the userbase.
>> >>
>> >> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
>> >> cluster? I get the feeling it's not nearly as transparent as upgrading
>> >> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters,
>> >> or is a new cluster (and/or upgrading all pipelines) required (e.g.
>> >> for those who operate spark clusters shared among their many users)?
>> >>
>> >> Looks like the latest release of Spark 1.x was about a year ago,
>> >> overlapping a bit with the 2.x series which is coming up on 1.5 years
>> >> old, so I could see a lot of people still using 1.x even if 2.x is
>> >> clearly the future. But it sure doesn't seem very backwards
>> >> compatible.
>> >>
>> >> Mostly I'm not comfortable with dropping 1.x in the same release as
>> >> adding support for 2.x, giving no transition period, but could be
>> >> convinced if this transition is mostly a no-op or no one's still using
>> >> 1.x. If there's non-trivial code complexity issues, I would perhaps
>> >> revisit the issue of having a single Spark Runner that does chooses
>> >> the backend implicitly in favor of simply having two runners which
>> >> share the code that's easy to share and diverge otherwise (which seems
>> >> it would be much simpler both to implement and explain to users). I
>> >> would be OK with even letting the Spark 1.x runner be somewhat
>> >> stagnant (e.g. few or no new features) until we decide we can kill it
>> >> off.
>> >>
>> >> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré > >
>> >> wrote:
>> >>
>> >>> Hi all,
>> >>>
>> >>> as you might know, we are working on Spark 2.x support in the Spark
>> >>> runner.
>> >>>
>> >>> I'm working on a PR about that:
>> >>>
>> >>> https://github.com/apache/beam/pull/3808
>> >>>
>> >>> Today, we have something working with both Spark 1.x and 2.x from a
>> code
>> >>> standpoint, but I have to deal with dependencies. It's the first step
>> of
>> >>> the
>> >>> update as I'm still using RDD, the second step would be to support
>> >>> dataframe
>> >>> (but for that, I would need PCollection elements with schemas, that's
>> >>> another topic on which Eugene, Reuven and I are discussing).
>> >>>
>> >>> However, as all major distributions now ship Spark 2.x, I don't think
>> >>> it's
>> >>> required anymore to support Spark 1.x.
>> >>>
>> >>> If we agree, I will update and cleanup the PR to only support and
>> focus
>> >>> on
>> >>> Spark 2.x.
>> >>>
>> >>> So, that's why I'm calling for a vote:
>> >>>
>> >>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>> >>>[ ] 0 (I don't care ;))
>> >>>[ ] -1, I would like to still support Spark 1.x, and so having
>> >>> support of
>> >>> both Spark 1.x and 2.x (please provide specific comment)
>> >>>
>> >>> This vote is open for 48 hours (I have the commits ready, just 

Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Ted Yu
Having both Spark1 and Spark2 modules would benefit wider user base.

I would vote for that.

Cheers

On Wed, Nov 8, 2017 at 12:51 AM, Jean-Baptiste Onofré 
wrote:

> Hi Robert,
>
> Thanks for your feedback !
>
> From an user perspective, with the current state of the PR, the same
> pipelines can run on both Spark 1.x and 2.x: the only difference is the
> dependencies set.
>
> I'm calling the vote to get suck kind of feedback: if we consider Spark
> 1.x still need to be supported, no problem, I will improve the PR to have
> three modules (common, spark1, spark2) and let users pick the desired
> version.
>
> Let's wait a bit other feedbacks, I will update the PR accordingly.
>
> Regards
> JB
>
>
> On 11/08/2017 09:47 AM, Robert Bradshaw wrote:
>
>> I'm generally a -0.5 on this change, or at least doing so hastily.
>>
>> As with dropping Java 7 support, I think this should at least be
>> announced in release notes that we're considering dropping support in
>> the subsequent release, as this dev list likely does not reach a
>> substantial portion of the userbase.
>>
>> How much work is it to move from a Spark 1.x cluster to a Spark 2.x
>> cluster? I get the feeling it's not nearly as transparent as upgrading
>> Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters,
>> or is a new cluster (and/or upgrading all pipelines) required (e.g.
>> for those who operate spark clusters shared among their many users)?
>>
>> Looks like the latest release of Spark 1.x was about a year ago,
>> overlapping a bit with the 2.x series which is coming up on 1.5 years
>> old, so I could see a lot of people still using 1.x even if 2.x is
>> clearly the future. But it sure doesn't seem very backwards
>> compatible.
>>
>> Mostly I'm not comfortable with dropping 1.x in the same release as
>> adding support for 2.x, giving no transition period, but could be
>> convinced if this transition is mostly a no-op or no one's still using
>> 1.x. If there's non-trivial code complexity issues, I would perhaps
>> revisit the issue of having a single Spark Runner that does chooses
>> the backend implicitly in favor of simply having two runners which
>> share the code that's easy to share and diverge otherwise (which seems
>> it would be much simpler both to implement and explain to users). I
>> would be OK with even letting the Spark 1.x runner be somewhat
>> stagnant (e.g. few or no new features) until we decide we can kill it
>> off.
>>
>> On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>> Hi all,
>>>
>>> as you might know, we are working on Spark 2.x support in the Spark
>>> runner.
>>>
>>> I'm working on a PR about that:
>>>
>>> https://github.com/apache/beam/pull/3808
>>>
>>> Today, we have something working with both Spark 1.x and 2.x from a code
>>> standpoint, but I have to deal with dependencies. It's the first step of
>>> the
>>> update as I'm still using RDD, the second step would be to support
>>> dataframe
>>> (but for that, I would need PCollection elements with schemas, that's
>>> another topic on which Eugene, Reuven and I are discussing).
>>>
>>> However, as all major distributions now ship Spark 2.x, I don't think
>>> it's
>>> required anymore to support Spark 1.x.
>>>
>>> If we agree, I will update and cleanup the PR to only support and focus
>>> on
>>> Spark 2.x.
>>>
>>> So, that's why I'm calling for a vote:
>>>
>>>[ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>>>[ ] 0 (I don't care ;))
>>>[ ] -1, I would like to still support Spark 1.x, and so having
>>> support of
>>> both Spark 1.x and 2.x (please provide specific comment)
>>>
>>> This vote is open for 48 hours (I have the commits ready, just waiting
>>> the
>>> end of the vote to push on the PR).
>>>
>>> Thanks !
>>> Regards
>>> JB
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [VOTE] Drop Spark 1.x support to focus on Spark 2.x

2017-11-08 Thread Robert Bradshaw
I'm generally a -0.5 on this change, or at least doing so hastily.

As with dropping Java 7 support, I think this should at least be
announced in release notes that we're considering dropping support in
the subsequent release, as this dev list likely does not reach a
substantial portion of the userbase.

How much work is it to move from a Spark 1.x cluster to a Spark 2.x
cluster? I get the feeling it's not nearly as transparent as upgrading
Java versions. Can Spark 1.x pipelines be run on Spark 2.x clusters,
or is a new cluster (and/or upgrading all pipelines) required (e.g.
for those who operate spark clusters shared among their many users)?

Looks like the latest release of Spark 1.x was about a year ago,
overlapping a bit with the 2.x series which is coming up on 1.5 years
old, so I could see a lot of people still using 1.x even if 2.x is
clearly the future. But it sure doesn't seem very backwards
compatible.

Mostly I'm not comfortable with dropping 1.x in the same release as
adding support for 2.x, giving no transition period, but could be
convinced if this transition is mostly a no-op or no one's still using
1.x. If there's non-trivial code complexity issues, I would perhaps
revisit the issue of having a single Spark Runner that does chooses
the backend implicitly in favor of simply having two runners which
share the code that's easy to share and diverge otherwise (which seems
it would be much simpler both to implement and explain to users). I
would be OK with even letting the Spark 1.x runner be somewhat
stagnant (e.g. few or no new features) until we decide we can kill it
off.

On Tue, Nov 7, 2017 at 11:27 PM, Jean-Baptiste Onofré  wrote:
> Hi all,
>
> as you might know, we are working on Spark 2.x support in the Spark runner.
>
> I'm working on a PR about that:
>
> https://github.com/apache/beam/pull/3808
>
> Today, we have something working with both Spark 1.x and 2.x from a code
> standpoint, but I have to deal with dependencies. It's the first step of the
> update as I'm still using RDD, the second step would be to support dataframe
> (but for that, I would need PCollection elements with schemas, that's
> another topic on which Eugene, Reuven and I are discussing).
>
> However, as all major distributions now ship Spark 2.x, I don't think it's
> required anymore to support Spark 1.x.
>
> If we agree, I will update and cleanup the PR to only support and focus on
> Spark 2.x.
>
> So, that's why I'm calling for a vote:
>
>   [ ] +1 to drop Spark 1.x support and upgrade to Spark 2.x only
>   [ ] 0 (I don't care ;))
>   [ ] -1, I would like to still support Spark 1.x, and so having support of
> both Spark 1.x and 2.x (please provide specific comment)
>
> This vote is open for 48 hours (I have the commits ready, just waiting the
> end of the vote to push on the PR).
>
> Thanks !
> Regards
> JB
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com