Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Hyukjin Kwon
Oops, one more - https://github.com/apache/spark/pull/6. I just read
this thread.

2018년 9월 6일 (목) 오후 12:12, Sean Owen 님이 작성:

> (I slipped https://github.com/apache/spark/pull/22340 in for Scala 2.12.
> Maybe it really is the last one. In any event, yes go ahead with a 2.4 RC)
>
> On Wed, Sep 5, 2018 at 8:14 PM Wenchen Fan  wrote:
>
>> The repartition correctness bug fix is merged. The Scala 2.12 PRs
>> mentioned in this thread are all merged. The Kryo upgrade is done.
>>
>> I'm going to cut the branch 2.4 since all the major blockers are now
>> resolved.
>>
>> Thanks,
>> Wenchen
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Sean Owen
(I slipped https://github.com/apache/spark/pull/22340 in for Scala 2.12.
Maybe it really is the last one. In any event, yes go ahead with a 2.4 RC)

On Wed, Sep 5, 2018 at 8:14 PM Wenchen Fan  wrote:

> The repartition correctness bug fix is merged. The Scala 2.12 PRs
> mentioned in this thread are all merged. The Kryo upgrade is done.
>
> I'm going to cut the branch 2.4 since all the major blockers are now
> resolved.
>
> Thanks,
> Wenchen
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-09-05 Thread Wenchen Fan
The repartition correctness bug fix is merged. The Scala 2.12 PRs mentioned
in this thread are all merged. The Kryo upgrade is done.

I'm going to cut the branch 2.4 since all the major blockers are now
resolved.

Thanks,
Wenchen

On Sun, Sep 2, 2018 at 12:07 AM sadhen  wrote:

> https://github.com/apache/spark/pull/22308
>
> https://github.com/apache/spark/pull/22310
>
>
> These two might be the last fixes for Scala 2.12 :)
>
>
> Please review.
>
>  原始邮件
> *发件人:* Sean Owen
> *收件人:* antonkulaga
> *抄送:* dev
> *发送时间:* 2018年8月31日(周五) 05:00
> *主题:* Re: code freeze and branch cut for Apache Spark 2.4
>
> I know it's famous last words, but we really might be down to the last
> fix: https://github.com/apache/spark/pull/22264 More a question of making
> tests happy at this point I think than fundamental problems. My goal is to
> make sure we can release a usable, but beta-quality, 2.12 release of Spark
> in 2.4.
>
> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:
>
>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>> coming
>> up and we don't need to block Spark 2.4 on this.
>>
>> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
>> to
>> suffer many months until Spark 2.5 with 2.12 support will be released.
>> Scala
>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>> then people will not be able to use them in their Zeppelin, Jupyter and
>> other notebooks together with Spark.
>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-09-01 Thread sadhen
https://github.com/apache/spark/pull/22308
https://github.com/apache/spark/pull/22310


These two might be the last fixes for Scala 2.12 :)


Please review.


原始邮件
发件人:Sean owensro...@apache.org
收件人:antonkulagaantonkul...@gmail.com
抄送:dev...@spark.apache.org
发送时间:2018年8月31日(周五) 05:00
主题:Re: code freeze and branch cut for Apache Spark 2.4


I know it's famous last words, but we really might be down to the last 
fix:https://github.com/apache/spark/pull/22264More a question of making tests 
happy at this point I think than fundamental problems. My goal is to make sure 
we can release a usable, but beta-quality, 2.12 release of Spark in 2.4.


On Thu, Aug 30, 2018 at 3:56 PM antonkulaga antonkul...@gmail.com wrote:

There are a few PRs to fix Scala 2.12 issues. I think they will keep coming
 up and we don't need to block Spark 2.4 on this.
 
 I think it can be better to wait a bit for Scala 2.12 support in 2.4 than to
 suffer many months until Spark 2.5 with 2.12 support will be released. Scala
 2.12 is not only about Spark but also about a lot of Scala libraries that
 stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
 then people will not be able to use them in their Zeppelin, Jupyter and
 other notebooks together with Spark.

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread shane knapp
+1 on beta support for scala 2.12

On Thu, Aug 30, 2018 at 2:33 PM, Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1 that would be great Sean, also you put a lot of effort in there, would
> make sense to wait a bit.
>
> Stavros
>
> On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen  wrote:
>
>> I know it's famous last words, but we really might be down to the last
>> fix: https://github.com/apache/spark/pull/22264 More a question of
>> making tests happy at this point I think than fundamental problems. My goal
>> is to make sure we can release a usable, but beta-quality, 2.12 release of
>> Spark in 2.4.
>>
>> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga 
>> wrote:
>>
>>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>>> coming
>>> up and we don't need to block Spark 2.4 on this.
>>>
>>> I think it can be better to wait a bit for Scala 2.12 support in 2.4
>>> than to
>>> suffer many months until Spark 2.5 with 2.12 support will be released.
>>> Scala
>>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>>> then people will not be able to use them in their Zeppelin, Jupyter and
>>> other notebooks together with Spark.
>>>
>>>
>
>
>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Stavros Kontopoulos
+1 that would be great Sean, also you put a lot of effort in there, would
make sense to wait a bit.

Stavros

On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen  wrote:

> I know it's famous last words, but we really might be down to the last
> fix: https://github.com/apache/spark/pull/22264 More a question of making
> tests happy at this point I think than fundamental problems. My goal is to
> make sure we can release a usable, but beta-quality, 2.12 release of Spark
> in 2.4.
>
> On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:
>
>> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
>> coming
>> up and we don't need to block Spark 2.4 on this.
>>
>> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
>> to
>> suffer many months until Spark 2.5 with 2.12 support will be released.
>> Scala
>> 2.12 is not only about Spark but also about a lot of Scala libraries that
>> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
>> then people will not be able to use them in their Zeppelin, Jupyter and
>> other notebooks together with Spark.
>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Sean Owen
I know it's famous last words, but we really might be down to the last fix:
https://github.com/apache/spark/pull/22264 More a question of making tests
happy at this point I think than fundamental problems. My goal is to make
sure we can release a usable, but beta-quality, 2.12 release of Spark in
2.4.

On Thu, Aug 30, 2018 at 3:56 PM antonkulaga  wrote:

> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
> coming
> up and we don't need to block Spark 2.4 on this.
>
> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
> to
> suffer many months until Spark 2.5 with 2.12 support will be released.
> Scala
> 2.12 is not only about Spark but also about a lot of Scala libraries that
> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
> then people will not be able to use them in their Zeppelin, Jupyter and
> other notebooks together with Spark.
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Reynold Xin
Let's see how they go. At some point we do need to cut the release. That
argument can be made on every feature, and different people place different
value / importance on different features, so we could just end up never
making a release.



On Thu, Aug 30, 2018 at 1:56 PM antonkulaga  wrote:

> >There are a few PRs to fix Scala 2.12 issues. I think they will keep
> coming
> up and we don't need to block Spark 2.4 on this.
>
> I think it can be better to wait a bit for Scala 2.12 support in 2.4 than
> to
> suffer many months until Spark 2.5 with 2.12 support will be released.
> Scala
> 2.12 is not only about Spark but also about a lot of Scala libraries that
> stopped supporting Scala 2.11, if Spark 2.4 will not support Scala 2.12,
> then people will not be able to use them in their Zeppelin, Jupyter and
> other notebooks together with Spark.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-29 Thread Wenchen Fan
A few updates on this thread:

We still have a blocking issue, the repartition correctness bug:
https://github.com/apache/spark/pull/22112
It's close to merging.

There are a few PRs to fix Scala 2.12 issues. I think they will keep coming
up and we don't need to block Spark 2.4 on this.

All other features/issues mentioned in this thread are either finished or
retargeted to the next release, hopefully we can cut the branch this week.

Thanks to everyone for your contributions! Please reply to this email if
you think something should be done before Spark 2.4.

Thanks,
Wenchen

On Tue, Aug 14, 2018 at 12:57 AM Xingbo Jiang  wrote:

> I'm working on the fix of SPARK-23243
>  and should be able
> push another commit in 1~2 days. More detailed discussions can go to the PR.
> Thanks for pushing this issue forward! I really appreciate efforts by
> submit PRs or involve in the discussions actively!
>
> 2018-08-13 22:50 GMT+08:00 Tom Graves :
>
>> I agree with Imran, we need to fix SPARK-23243
>>  and any correctness
>> issues for that matter.
>>
>> Tom
>>
>> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid
>>  wrote:
>>
>>
>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>
>> SPARK-23243 : 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>>
>> This is a really serious data loss bug.  Yes its very complex, but we
>> absolutely have to fix this, I really think it should be in 2.4.
>> Has worked on it stopped?
>>
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-13 Thread Xingbo Jiang
I'm working on the fix of SPARK-23243
 and should be able push
another commit in 1~2 days. More detailed discussions can go to the PR.
Thanks for pushing this issue forward! I really appreciate efforts by
submit PRs or involve in the discussions actively!

2018-08-13 22:50 GMT+08:00 Tom Graves :

> I agree with Imran, we need to fix SPARK-23243
>  and any correctness
> issues for that matter.
>
> Tom
>
> On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid
>  wrote:
>
>
> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>
> SPARK-23243 : 
> Shuffle+Repartition
> on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about
> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
> long-standing issue, not a regression.
>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-13 Thread Tom Graves
 I agree with Imran, we need to fix SPARK-23243 and any correctness issues for 
that matter.
Tom
On Wednesday, August 8, 2018, 9:06:43 AM CDT, Imran Rashid 
 wrote:  
 
 On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
It turns out to be a very complicated issue, there is no consensus about what 
is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
long-standing issue, not a regression.

This is a really serious data loss bug.  Yes its very complex, but we 
absolutely have to fix this, I really think it should be in 2.4.Has worked on 
it stopped?  

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-11 Thread Petar Zečević


Hi, I made some changes to SPARK-24020 
(https://github.com/apache/spark/pull/21109) and implemented spill-over to 
disk. I believe there are no objections to the implementation left and that 
this can now be merged.

Please take a look.

Thanks,

Petar Zečević


Wenchen Fan @ 1970-01-01 01:00 CET:

> Some updates for the JIRA tickets that we want to resolve before Spark 2.4.
>
> green: merged
> orange: in progress
> red: likely to miss
>
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> The core functionality is finished, but we still need to add Python API. 
> Tracked by SPARK-24822
>
> SPARK-23899: Built-in SQL Function Improvement
> I think it's ready to go. Although there are still some functions working in 
> progress, the common ones are all merged.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> It's close, just one last piece. Tracked by SPARK-25029
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> Being reviewed.
>
> SPARK-24882: data source v2 API improvement
> PR is out, being reviewed.
>
> SPARK-24252: Add catalog support in Data Source V2
> Being reviewed.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> It's close, just one last piece: the decimal type support
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about what 
> is the right fix yet. Likely to miss it in Spark 2.4 because it's a 
> long-standing issue, not a regression.
>
> SPARK-24598: Datatype overflow conditions gives incorrect result
> We decided to keep the current behavior in Spark 2.4 and add some 
> document(already done). We will re-consider this change in Spark 3.0.
>
> SPARK-24020: Sort-merge join inner range optimization
> There are some discussions about the design, I don't think we can get to a 
> consensus within Spark 2.4.
>
> SPARK-24296: replicating large blocks over 2GB
> Being reviewed.
>
> SPARK-23874: upgrade to Apache Arrow 0.10.0
> Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we should 
> wait a few days.
>
> According to the status, I think we should wait a few more days. Any 
> objections?
>
> Thanks,
> Wenchen
>
> On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:
>
>  ... and we still have a few snags with Scala 2.12 support at 
> https://issues.apache.org/jira/browse/SPARK-25029 
>
>  There is some hope of resolving it on the order of a week, so for the 
> moment, seems worth holding 2.4 for.
>
>  On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>  Hi All,
>
>  I'd like to request a few days extension to the code freeze to complete the 
> upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several 
> key improvements and bug fixes.  The RC vote just passed this morning and code
>  changes are complete in https://github.com/apache/spark/pull/21939. We just 
> need some time for the release artifacts to be available. Thoughts?
>
>  Thanks,
>  Bryan


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
>
>
> I also think it's a good idea to test against newer Python versions. But I
> don't know how difficult it is and whether or not it's feasible to resolve
> that between branch cut and RC cut.
>

>
unless someone pops in to this thread and tells me w/o a doubt that all
spark branches will happily pass against 3.5, it will not happen until
after the 2.4 cut.  :)

however, from my (limited) testing, it does look like that's the case.
still not gonna pull the trigger on it until after the cut.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Li Jin
I agree with Byran. If it's acceptable to have another job to test with
Python 3.5 and pyarrow 0.10.0, I am leaning towards upgrading arrow.

Arrow 0.10.0 has tons of bug fixes and improves from 0.8.0, including
important memory leak fixes such as
https://issues.apache.org/jira/browse/ARROW-1973. I think releasing with
0.10.0 will improve the overall experience of arrow related features quite
bit.

I also think it's a good idea to test against newer Python versions. But I
don't know how difficult it is and whether or not it's feasible to resolve
that between branch cut and RC cut.

On Fri, Aug 10, 2018 at 5:44 PM, shane knapp  wrote:

> see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343
>
> yes, i can set up a build.  have some Qs in the PR about building the
> spark package before running the python tests.
>
> On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:
>
>> I agree that we should hold off on the Arrow upgrade if it requires major
>> changes to our testing. I did have another thought that maybe we could just
>> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
>> current testing the same? I'm not sure how doable that is right now and
>> don't want to make a ton of extra work, so no objections from me to hold
>> off on things for now.
>>
>> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>>
>>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan 
>>> wrote:
>>>
 It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
 it to Spark 3.0, so that we have more time to test. Any objections?

>>>
>>> none here.
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
python 3.5/pyarrow 0.10.0 build:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.6-python-3.5-arrow-0.10.0-ubuntu-testing/

On Fri, Aug 10, 2018 at 10:44 AM, shane knapp  wrote:

> see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343
>
> yes, i can set up a build.  have some Qs in the PR about building the
> spark package before running the python tests.
>
> On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:
>
>> I agree that we should hold off on the Arrow upgrade if it requires major
>> changes to our testing. I did have another thought that maybe we could just
>> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
>> current testing the same? I'm not sure how doable that is right now and
>> don't want to make a ton of extra work, so no objections from me to hold
>> off on things for now.
>>
>> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>>
>>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan 
>>> wrote:
>>>
 It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
 it to Spark 3.0, so that we have more time to test. Any objections?

>>>
>>> none here.
>>>
>>> --
>>> Shane Knapp
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>



-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
see:  https://github.com/apache/spark/pull/21939#issuecomment-412154343

yes, i can set up a build.  have some Qs in the PR about building the spark
package before running the python tests.

On Fri, Aug 10, 2018 at 10:41 AM, Bryan Cutler  wrote:

> I agree that we should hold off on the Arrow upgrade if it requires major
> changes to our testing. I did have another thought that maybe we could just
> add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
> current testing the same? I'm not sure how doable that is right now and
> don't want to make a ton of extra work, so no objections from me to hold
> off on things for now.
>
> On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:
>
>> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>>
>>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>>> it to Spark 3.0, so that we have more time to test. Any objections?
>>>
>>
>> none here.
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Bryan Cutler
I agree that we should hold off on the Arrow upgrade if it requires major
changes to our testing. I did have another thought that maybe we could just
add another job to test against Python 3.5 and pyarrow 0.10.0 and keep all
current testing the same? I'm not sure how doable that is right now and
don't want to make a ton of extra work, so no objections from me to hold
off on things for now.

On Fri, Aug 10, 2018 at 9:48 AM, shane knapp  wrote:

> On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:
>
>> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave
>> it to Spark 3.0, so that we have more time to test. Any objections?
>>
>
> none here.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
On Fri, Aug 10, 2018 at 9:47 AM, Wenchen Fan  wrote:

> It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it
> to Spark 3.0, so that we have more time to test. Any objections?
>

none here.

-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread Wenchen Fan
It seems safer to skip the arrow 0.10.0 upgrade for Spark 2.4 and leave it
to Spark 3.0, so that we have more time to test. Any objections?

On Fri, Aug 10, 2018 at 11:53 PM shane knapp  wrote:

> quick update from my end:
>
> SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)
>
> SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5
> upgrade)
>
> both SPARK-25087 and SPARK-25079 are in progress and i'm very very
> hesitant to do these upgrades before the code freeze/branch cut.  i've done
> a TON of testing, but even as of yesterday afternoon, i'm still uncovering
> bugs and things that need fixing both on the infrastructure side and spark
> itself.
>
> h/t sean owen for helping out on SPARK-24950
>
> On Wed, Aug 8, 2018 at 10:51 AM, Mark Hamstra 
> wrote:
>
>> I'm inclined to agree. Just saying that it is not a regression doesn't
>> really cut it when it is a now known data correctness issue. We need
>> something a lot more than nothing before releasing 2.4.0. At a barest
>> minimum, that has to be much more complete and publicly highlighted
>> documentation of the issue so that users are less likely to stumble into
>> this unaware; but really we need to fix at least the most common cases of
>> this bug. Backports to maintenance branches are also probably in order.
>>
>> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
>> wrote:
>>
>>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:

 SPARK-23243 : 
 Shuffle+Repartition
 on an RDD could lead to incorrect answers
 It turns out to be a very complicated issue, there is no consensus
 about what is the right fix yet. Likely to miss it in Spark 2.4 because
 it's a long-standing issue, not a regression.

>>>
>>> This is a really serious data loss bug.  Yes its very complex, but we
>>> absolutely have to fix this, I really think it should be in 2.4.
>>> Has worked on it stopped?
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-10 Thread shane knapp
quick update from my end:

SPARK-24433 (SparkR/k8s) depends on SPARK-25087 (move builds to ubuntu)

SPARK-23874 (arrow -> 0.10.0) now depends on SPARK-25079 (python 3.5
upgrade)

both SPARK-25087 and SPARK-25079 are in progress and i'm very very hesitant
to do these upgrades before the code freeze/branch cut.  i've done a TON of
testing, but even as of yesterday afternoon, i'm still uncovering bugs and
things that need fixing both on the infrastructure side and spark itself.

h/t sean owen for helping out on SPARK-24950

On Wed, Aug 8, 2018 at 10:51 AM, Mark Hamstra 
wrote:

> I'm inclined to agree. Just saying that it is not a regression doesn't
> really cut it when it is a now known data correctness issue. We need
> something a lot more than nothing before releasing 2.4.0. At a barest
> minimum, that has to be much more complete and publicly highlighted
> documentation of the issue so that users are less likely to stumble into
> this unaware; but really we need to fix at least the most common cases of
> this bug. Backports to maintenance branches are also probably in order.
>
> On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
> wrote:
>
>> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>>
>>> SPARK-23243 : 
>>> Shuffle+Repartition
>>> on an RDD could lead to incorrect answers
>>> It turns out to be a very complicated issue, there is no consensus about
>>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>>> long-standing issue, not a regression.
>>>
>>
>> This is a really serious data loss bug.  Yes its very complex, but we
>> absolutely have to fix this, I really think it should be in 2.4.
>> Has worked on it stopped?
>>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Mark Hamstra
I'm inclined to agree. Just saying that it is not a regression doesn't
really cut it when it is a now known data correctness issue. We need
something a lot more than nothing before releasing 2.4.0. At a barest
minimum, that has to be much more complete and publicly highlighted
documentation of the issue so that users are less likely to stumble into
this unaware; but really we need to fix at least the most common cases of
this bug. Backports to maintenance branches are also probably in order.

On Wed, Aug 8, 2018 at 7:06 AM Imran Rashid 
wrote:

> On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>>
>> SPARK-23243 : 
>> Shuffle+Repartition
>> on an RDD could lead to incorrect answers
>> It turns out to be a very complicated issue, there is no consensus about
>> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
>> long-standing issue, not a regression.
>>
>
> This is a really serious data loss bug.  Yes its very complex, but we
> absolutely have to fix this, I really think it should be in 2.4.
> Has worked on it stopped?
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-08 Thread Imran Rashid
On Tue, Aug 7, 2018 at 8:39 AM, Wenchen Fan  wrote:
>
> SPARK-23243 : 
> Shuffle+Repartition
> on an RDD could lead to incorrect answers
> It turns out to be a very complicated issue, there is no consensus about
> what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
> long-standing issue, not a regression.
>

This is a really serious data loss bug.  Yes its very complex, but we
absolutely have to fix this, I really think it should be in 2.4.
Has worked on it stopped?


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread John Zhuge
+1 on SPARK-25004. We have found it quite useful to diagnose PySpark OOM.

On Tue, Aug 7, 2018 at 1:21 PM Holden Karau  wrote:

> I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
> but solving some of the consistent Python memory issues we've had for years
> would be really amazing to get in.
>
> On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
> wrote:
>
>> I would like to get clarification on our avro compatibility story before
>> the release.  anyone interested please look at -
>> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
>> have filed a separate jira and can if we don't resolve via discussion there.
>>
>> Tom
>>
>> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
>> skn...@berkeley.edu> wrote:
>>
>>
>> According to the status, I think we should wait a few more days. Any
>> objections?
>>
>>
>> none here.
>>
>> i'm also pretty certain that waiting until after the code freeze to start
>> testing the GHPRB on ubuntu is the wisest course of action for us.
>>
>> shane
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>>
>
>
> --
> Twitter: https://twitter.com/holdenkarau
>


-- 
John


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Holden Karau
I'd like to suggest we consider  SPARK-25004  (hopefully it goes in soon),
but solving some of the consistent Python memory issues we've had for years
would be really amazing to get in.

On Tue, Aug 7, 2018 at 1:07 PM, Tom Graves 
wrote:

> I would like to get clarification on our avro compatibility story before
> the release.  anyone interested please look at -
> https://issues.apache.org/jira/browse/SPARK-24924 . I probably should
> have filed a separate jira and can if we don't resolve via discussion there.
>
> Tom
>
> On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp <
> skn...@berkeley.edu> wrote:
>
>
> According to the status, I think we should wait a few more days. Any
> objections?
>
>
> none here.
>
> i'm also pretty certain that waiting until after the code freeze to start
> testing the GHPRB on ubuntu is the wisest course of action for us.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
>


-- 
Twitter: https://twitter.com/holdenkarau


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Tom Graves
 I would like to get clarification on our avro compatibility story before the 
release.  anyone interested please look at - 
https://issues.apache.org/jira/browse/SPARK-24924 . I probably should have 
filed a separate jira and can if we don't resolve via discussion there.
Tom 
On Tuesday, August 7, 2018, 11:46:31 AM CDT, shane knapp 
 wrote:  
 
 
According to the status, I think we should wait a few more days. Any objections?


none here.
i'm also pretty certain that waiting until after the code freeze to start 
testing the GHPRB on ubuntu is the wisest course of action for us.
shane -- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu
  

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread shane knapp
>
> According to the status, I think we should wait a few more days. Any
> objections?
>
> none here.

i'm also pretty certain that waiting until after the code freeze to start
testing the GHPRB on ubuntu is the wisest course of action for us.

shane
-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-07 Thread Wenchen Fan
Some updates for the JIRA tickets that we want to resolve before Spark 2.4.

green: merged
orange: in progress
red: likely to miss

SPARK-24374 : Support
Barrier Execution Mode in Apache Spark
The core functionality is finished, but we still need to add Python API.
Tracked by SPARK-24822 

SPARK-23899 : Built-in
SQL Function Improvement
I think it's ready to go. Although there are still some functions working
in progress, the common ones are all merged.

SPARK-14220 : Build and
test Spark against Scala 2.12
It's close, just one last piece. Tracked by SPARK-25029


SPARK-4502 : Spark SQL
reads unnecessary nested fields from Parquet
Being reviewed.

SPARK-24882 : data
source v2 API improvement
PR is out, being reviewed.

SPARK-24252 : Add
catalog support in Data Source V2
Being reviewed.

SPARK-24768 : Have a
built-in AVRO data source implementation
It's close, just one last piece: the decimal type support

SPARK-23243 :
Shuffle+Repartition
on an RDD could lead to incorrect answers
It turns out to be a very complicated issue, there is no consensus about
what is the right fix yet. Likely to miss it in Spark 2.4 because it's a
long-standing issue, not a regression.

SPARK-24598 : Datatype
overflow conditions gives incorrect result
We decided to keep the current behavior in Spark 2.4 and add some
document(already done). We will re-consider this change in Spark 3.0.

SPARK-24020 : Sort-merge
join inner range optimization
There are some discussions about the design, I don't think we can get to a
consensus within Spark 2.4.

SPARK-24296 : replicating
large blocks over 2GB
Being reviewed.

SPARK-23874 : upgrade to
Apache Arrow 0.10.0
Apache Arrow 0.10.0 has some critical bug fixes and is being voted, we
should wait a few days.


According to the status, I think we should wait a few more days. Any
objections?

Thanks,
Wenchen


On Tue, Aug 7, 2018 at 3:39 AM Sean Owen  wrote:

> ... and we still have a few snags with Scala 2.12 support at
> https://issues.apache.org/jira/browse/SPARK-25029
>
> There is some hope of resolving it on the order of a week, so for the
> moment, seems worth holding 2.4 for.
>
> On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:
>
>> Hi All,
>>
>> I'd like to request a few days extension to the code freeze to complete
>> the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes
>> several key improvements and bug fixes.  The RC vote just passed this
>> morning and code changes are complete in
>> https://github.com/apache/spark/pull/21939. We just need some time for
>> the release artifacts to be available. Thoughts?
>>
>> Thanks,
>> Bryan
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Sean Owen
... and we still have a few snags with Scala 2.12 support at
https://issues.apache.org/jira/browse/SPARK-25029

There is some hope of resolving it on the order of a week, so for the
moment, seems worth holding 2.4 for.

On Mon, Aug 6, 2018 at 2:37 PM Bryan Cutler  wrote:

> Hi All,
>
> I'd like to request a few days extension to the code freeze to complete
> the upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes
> several key improvements and bug fixes.  The RC vote just passed this
> morning and code changes are complete in
> https://github.com/apache/spark/pull/21939. We just need some time for
> the release artifacts to be available. Thoughts?
>
> Thanks,
> Bryan
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-06 Thread Bryan Cutler
Hi All,

I'd like to request a few days extension to the code freeze to complete the
upgrade to Apache Arrow 0.10.0, SPARK-23874. This upgrade includes several
key improvements and bug fixes.  The RC vote just passed this morning and
code changes are complete in https://github.com/apache/spark/pull/21939. We
just need some time for the release artifacts to be available. Thoughts?

Thanks,
Bryan

On Wed, Aug 1, 2018, 5:34 PM shane knapp  wrote:

> ++ssuchter (who kindly set up the initial k8s builds while i hammered on
> the backend)
>
> while i'm pretty confident (read: 99%) that the pull request builds will
> work on the new ubuntu workers:
>
> 1) i'd like to do more stress testing of other spark builds (in progress)
> 2) i'd like to reimage more centos workers before moving the PRB due to
> potential executor starvation, and my lead sysadmin is out until next monday
> 3) we will need to get rid of the ubuntu-specific k8s builds and merge
> that functionality in to the existing PRB job.  after that:  testing and
> babysitting
>
> regarding (1):  if these damn builds didn't take 4+ hours, it would be
> going a lot quicker.  ;)
> regarding (2):  adding two more ubuntu workers would make me comfortable
> WRT number of available executors, and i will guarantee that can happen by
> EOD on the 7th.
> regarding (3):  this should take about a day, and realistically the
> earliest we can get this started is the 8th.  i haven't even had a chance
> to start looking at this stuff yet, either.
>
> if we push release by a week, i think i can get things sorted w/o
> impacting the release schedule.  there will still be a bunch of stuff to
> clean up from the old centos builds (specifically docs, packaging and
> release), but i'll leave the existing and working infrastructure in place
> for now.
>
> shane
>
> On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson 
> wrote:
>
>> The PR for SparkR support on the kube back-end is completed, but waiting
>> for Shane to make some tweaks to the CI machinery for full testing support.
>> If the code freeze is being delayed, this PR could be merged as well.
>>
>> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread shane knapp
++ssuchter (who kindly set up the initial k8s builds while i hammered on
the backend)

while i'm pretty confident (read: 99%) that the pull request builds will
work on the new ubuntu workers:

1) i'd like to do more stress testing of other spark builds (in progress)
2) i'd like to reimage more centos workers before moving the PRB due to
potential executor starvation, and my lead sysadmin is out until next monday
3) we will need to get rid of the ubuntu-specific k8s builds and merge that
functionality in to the existing PRB job.  after that:  testing and
babysitting

regarding (1):  if these damn builds didn't take 4+ hours, it would be
going a lot quicker.  ;)
regarding (2):  adding two more ubuntu workers would make me comfortable
WRT number of available executors, and i will guarantee that can happen by
EOD on the 7th.
regarding (3):  this should take about a day, and realistically the
earliest we can get this started is the 8th.  i haven't even had a chance
to start looking at this stuff yet, either.

if we push release by a week, i think i can get things sorted w/o impacting
the release schedule.  there will still be a bunch of stuff to clean up
from the old centos builds (specifically docs, packaging and release), but
i'll leave the existing and working infrastructure in place for now.

shane

On Wed, Aug 1, 2018 at 4:39 PM, Erik Erlandson  wrote:

> The PR for SparkR support on the kube back-end is completed, but waiting
> for Shane to make some tweaks to the CI machinery for full testing support.
> If the code freeze is being delayed, this PR could be merged as well.
>
> On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:
>
>> FYI 6 mo is coming up soon since the last release. We will cut the branch
>> and code freeze on Aug 1st in order to get 2.4 out on time.
>>
>>
>


-- 
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
The PR for SparkR support on the kube back-end is completed, but waiting
for Shane to make some tweaks to the CI machinery for full testing support.
If the code freeze is being delayed, this PR could be merged as well.

On Fri, Jul 6, 2018 at 9:47 AM, Reynold Xin  wrote:

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Erik Erlandson
I agree that looking at it from the pov of "code paths where isBarrier
tests were introduced" seems right.

>From pr-21758  (the one
already merged) there are 13 files touched under
core/src/main/scala/org/apache/spark/scheduler/, although most of those
appear to be relatively small edits. The "big" modifications are
concentrated on Task.scala and TaskSchedulerImpl.scala. The followup
pr-21898  touches a
subset of those.

The project-hydrogen epic for "barrier execution" SPARK-24374
 contains 22 sub-issues,
most of which are still open. Some are marked for future release cycles; Is
there a specific set being proposed for 2.4?  The various back-end supports
look tagged for subsequent release cycles: is the 2.4 scope standalone
clusters?

CI will obviously exercise standard task scheduling code paths, which
indicates some level of stability.  Folks on the k8s big data SIG today
were interested in building test distributions for the barrier-related
features. I was reflecting that although the spark-on-kube fork was awkward
in some ways, it did provide a unified distribution that interested
community members could build, download and/or run. Project hydrogen is
currently incarnated as a set of PRs, but a unified test build that
included pr-21758  and
pr-21898  (and others?)
would be cool. I've never seen an ideal workflow for handling multi-PR
development efforts.


On Wed, Aug 1, 2018 at 1:43 PM, Imran Rashid  wrote:

> I still would like to do more review on barrier mode changes, but from
> what I've seen so far I agree. I dunno if it'll really be ready for use,
> but it should not pose much risk for code which doesn't touch the new
> features.  of course, every change has some risk, especially in the
> scheduler which has proven to be very brittle (I've written plenty of
> scheduler bugs while fixing other things myself).
>
> On Wed, Aug 1, 2018 at 1:13 PM, Xingbo Jiang 
> wrote:
>
>> Speaking of the code from hydrogen PRs, actually we didn't remove any of
>> the existing logic, and I tried my best to hide almost all of the newly
>> added logic behind a `isBarrier` tag (or something similar). I have to add
>> some new variables and new methods to the core code paths, but I think they
>> shall not be hit if you are not running barrier workloads.
>>
>> The only significant change I can think of is I swapped the sequence of
>> failure handling in DAGScheduler, moving the `case FetchFailed` block to
>> before the `case Resubmitted` block, but again I don't think this shall
>> affect a regular workload because anyway you can only have one failure type.
>>
>> Actually I also reviewed the previous PRs adding Spark on K8s support,
>> and I feel it's a good example of how to add new features to a project
>> without breaking existing workloads, I'm trying to follow that way in
>> adding barrier execution mode support.
>>
>> I really appreciate any notice on hydrogen PRs and welcome comments to
>> help improve the feature, thanks!
>>
>> 2018-08-01 4:19 GMT+08:00 Reynold Xin :
>>
>>> I actually totally agree that we should make sure it should have no
>>> impact on existing code if the feature is not used.
>>>
>>>
>>> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
>>> wrote:
>>>
 I don't have a comprehensive knowledge of the project hydrogen PRs,
 however I've perused them, and they make substantial modifications to
 Spark's core DAG scheduler code.

 What I'm wondering is: how high is the confidence level that the
 "traditional" code paths are still stable. Put another way, is it even
 possible to "turn off" or "opt out" of this experimental feature? This
 analogy isn't perfect, but for example the k8s back-end is a major body of
 code, but it has a very small impact on any *core* code paths, and so if
 you opt out of it, it is well understood that you aren't running any
 experimental code.

 Looking at the project hydrogen code, I'm less sure the same is true.
 However, maybe there is a clear way to show how it is true.


 On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra >>> > wrote:

> No reasonable amount of time is likely going to be sufficient to fully
> vet the code as a PR. I'm not entirely happy with the design and code as
> they currently are (and I'm still trying to find the time to more publicly
> express my thoughts and concerns), but I'm fine with them going into 2.4
> much as they are as long as they go in with proper stability annotations
> and are understood not to be cast-in-stone final implementations, but
> rather as a way to get people using them and generating the feedback that
> is necessary to get us to something more like a final design and
> 

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Imran Rashid
I still would like to do more review on barrier mode changes, but from what
I've seen so far I agree. I dunno if it'll really be ready for use, but it
should not pose much risk for code which doesn't touch the new features.
of course, every change has some risk, especially in the scheduler which
has proven to be very brittle (I've written plenty of scheduler bugs while
fixing other things myself).

On Wed, Aug 1, 2018 at 1:13 PM, Xingbo Jiang  wrote:

> Speaking of the code from hydrogen PRs, actually we didn't remove any of
> the existing logic, and I tried my best to hide almost all of the newly
> added logic behind a `isBarrier` tag (or something similar). I have to add
> some new variables and new methods to the core code paths, but I think they
> shall not be hit if you are not running barrier workloads.
>
> The only significant change I can think of is I swapped the sequence of
> failure handling in DAGScheduler, moving the `case FetchFailed` block to
> before the `case Resubmitted` block, but again I don't think this shall
> affect a regular workload because anyway you can only have one failure type.
>
> Actually I also reviewed the previous PRs adding Spark on K8s support, and
> I feel it's a good example of how to add new features to a project without
> breaking existing workloads, I'm trying to follow that way in adding
> barrier execution mode support.
>
> I really appreciate any notice on hydrogen PRs and welcome comments to
> help improve the feature, thanks!
>
> 2018-08-01 4:19 GMT+08:00 Reynold Xin :
>
>> I actually totally agree that we should make sure it should have no
>> impact on existing code if the feature is not used.
>>
>>
>> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
>> wrote:
>>
>>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>>> however I've perused them, and they make substantial modifications to
>>> Spark's core DAG scheduler code.
>>>
>>> What I'm wondering is: how high is the confidence level that the
>>> "traditional" code paths are still stable. Put another way, is it even
>>> possible to "turn off" or "opt out" of this experimental feature? This
>>> analogy isn't perfect, but for example the k8s back-end is a major body of
>>> code, but it has a very small impact on any *core* code paths, and so if
>>> you opt out of it, it is well understood that you aren't running any
>>> experimental code.
>>>
>>> Looking at the project hydrogen code, I'm less sure the same is true.
>>> However, maybe there is a clear way to show how it is true.
>>>
>>>
>>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>>> wrote:
>>>
 No reasonable amount of time is likely going to be sufficient to fully
 vet the code as a PR. I'm not entirely happy with the design and code as
 they currently are (and I'm still trying to find the time to more publicly
 express my thoughts and concerns), but I'm fine with them going into 2.4
 much as they are as long as they go in with proper stability annotations
 and are understood not to be cast-in-stone final implementations, but
 rather as a way to get people using them and generating the feedback that
 is necessary to get us to something more like a final design and
 implementation.

 On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
 wrote:

>
> Barrier mode seems like a high impact feature on Spark's core code: is
> one additional week enough time to properly vet this feature?
>
> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Full continuous processing aggregation support ran into unanticipated
>> scalability and scheduling problems. We’re planning to overcome those by
>> using some of the barrier execution machinery, but since barrier 
>> execution
>> itself is still in progress the full support isn’t going to make it into
>> 2.4.
>>
>> Jose
>>
>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
>> tomasz.gaw...@outlook.com> wrote:
>>
>>> Hi,
>>>
>>> what is the status of Continuous Processing + Aggregations? As far
>>> as I
>>> remember, Jose Torres said it should  be easy to perform
>>> aggregations if
>>> coalesce(1) work. IIRC it's already merged to master.
>>>
>>> Is this work in progress? If yes, it would be great to have full
>>> aggregation/join support in Spark 2.4 in CP.
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomek
>>>
>>>
>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>> > This one is important to us: https://issues.apache.org/jira
>>> /browse/SPARK-24020 (Sort-merge join inner range optimization) but
>>> I think it could be useful to others too.
>>> >
>>> > It is finished and is ready to be merged (was ready a month ago at
>>> least).
>>> >
>>> > Do you think you could consider including it in 2.4?
>>> >
>>> > Petar
>>> >
>>> >

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xingbo Jiang
Speaking of the code from hydrogen PRs, actually we didn't remove any of
the existing logic, and I tried my best to hide almost all of the newly
added logic behind a `isBarrier` tag (or something similar). I have to add
some new variables and new methods to the core code paths, but I think they
shall not be hit if you are not running barrier workloads.

The only significant change I can think of is I swapped the sequence of
failure handling in DAGScheduler, moving the `case FetchFailed` block to
before the `case Resubmitted` block, but again I don't think this shall
affect a regular workload because anyway you can only have one failure type.

Actually I also reviewed the previous PRs adding Spark on K8s support, and
I feel it's a good example of how to add new features to a project without
breaking existing workloads, I'm trying to follow that way in adding
barrier execution mode support.

I really appreciate any notice on hydrogen PRs and welcome comments to help
improve the feature, thanks!

2018-08-01 4:19 GMT+08:00 Reynold Xin :

> I actually totally agree that we should make sure it should have no impact
> on existing code if the feature is not used.
>
>
> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
> wrote:
>
>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>> however I've perused them, and they make substantial modifications to
>> Spark's core DAG scheduler code.
>>
>> What I'm wondering is: how high is the confidence level that the
>> "traditional" code paths are still stable. Put another way, is it even
>> possible to "turn off" or "opt out" of this experimental feature? This
>> analogy isn't perfect, but for example the k8s back-end is a major body of
>> code, but it has a very small impact on any *core* code paths, and so if
>> you opt out of it, it is well understood that you aren't running any
>> experimental code.
>>
>> Looking at the project hydrogen code, I'm less sure the same is true.
>> However, maybe there is a clear way to show how it is true.
>>
>>
>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>> wrote:
>>
>>> No reasonable amount of time is likely going to be sufficient to fully
>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>> they currently are (and I'm still trying to find the time to more publicly
>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>> much as they are as long as they go in with proper stability annotations
>>> and are understood not to be cast-in-stone final implementations, but
>>> rather as a way to get people using them and generating the feedback that
>>> is necessary to get us to something more like a final design and
>>> implementation.
>>>
>>> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>>> wrote:
>>>

 Barrier mode seems like a high impact feature on Spark's core code: is
 one additional week enough time to properly vet this feature?

 On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
 joseph.tor...@databricks.com> wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
> tomasz.gaw...@outlook.com> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as
>> I
>> remember, Jose Torres said it should  be easy to perform aggregations
>> if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us: https://issues.apache.org/
>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization)
>> but I think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we
>> should consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It
>> only has a few remaining works and I think we should have it in Spark 
>> 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions 

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-01 Thread Xiangrui Meng
Sorry for late response on Hydrogen discussions! I was traveling last week.

On Tue, Jul 31, 2018 at 1:20 PM Reynold Xin  wrote:

> I actually totally agree that we should make sure it should have no impact
> on existing code if the feature is not used.
>
>
> On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson 
> wrote:
>
>> I don't have a comprehensive knowledge of the project hydrogen PRs,
>> however I've perused them, and they make substantial modifications to
>> Spark's core DAG scheduler code.
>>
>> What I'm wondering is: how high is the confidence level that the
>> "traditional" code paths are still stable. Put another way, is it even
>> possible to "turn off" or "opt out" of this experimental feature? This
>> analogy isn't perfect, but for example the k8s back-end is a major body of
>> code, but it has a very small impact on any *core* code paths, and so if
>> you opt out of it, it is well understood that you aren't running any
>> experimental code.
>>
>> Looking at the project hydrogen code, I'm less sure the same is true.
>> However, maybe there is a clear way to show how it is true.
>>
>>
Totally agree that the barrier execution mode must not change any existing
behaviors if barriers are not used. Most code added to DAGScheduler and
TaskSetManager only applies to the barrier mode and we paid special
attention to the rest during review. That being said, I won't say the risk
is zero. We will do comprehensive QA after feature freeze and it would be
great if more community members can help.

Btw, I don't think a feature flag would help reduce the risk. This is a
brand new feature, not an alternative to an existing one. So turning it off
is basically "do not call barrier()".


>
>> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
>> wrote:
>>
>>> No reasonable amount of time is likely going to be sufficient to fully
>>> vet the code as a PR. I'm not entirely happy with the design and code as
>>> they currently are (and I'm still trying to find the time to more publicly
>>> express my thoughts and concerns), but I'm fine with them going into 2.4
>>> much as they are as long as they go in with proper stability annotations
>>> and are understood not to be cast-in-stone final implementations, but
>>> rather as a way to get people using them and generating the feedback that
>>> is necessary to get us to something more like a final design and
>>> implementation.
>>>
>>>
All barrier execution mode features will be marked experimental in 2.4. As
you mentioned, the goal is to get some usage and collect feedback so we
have a robust stable version in 3.0. Mark, it would be great if you can
provide input and help the final design. Your time would be greatly
appreciated!


> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>>> wrote:
>>>

 Barrier mode seems like a high impact feature on Spark's core code: is
 one additional week enough time to properly vet this feature?

 On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
 joseph.tor...@databricks.com> wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
> tomasz.gaw...@outlook.com> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as
>> I
>> remember, Jose Torres said it should  be easy to perform aggregations
>> if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us:
>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>> inner range optimization) but I think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we
>> should consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It
>> only has a few remaining works and I think we should have it in Spark 
>> 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions in this release,

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Imran Rashid
I'd like to add SPARK-24296, replicating large blocks over 2GB.  Its been
up for review for a while, and would end the 2GB block limit (well ...
subject to a couple of caveats on SPARK-6235).

On Mon, Jul 30, 2018 at 9:01 PM, Wenchen Fan  wrote:

> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
>
> *High Priority*:
> SPARK-24374 : Support
> Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has
> a few remaining works and I think we should have it in Spark 2.4.
>
> *Middle Priority*:
> SPARK-23899 : Built-in
> SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there
> are a few useful higher-order functions in progress, like `array_except`,
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220 : Build
> and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502 : Spark SQL
> reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
>
> SPARK-24882 : data
> source v2 API improvement
> This is to improve the data source v2 API based on what we learned during
> this release. From the migration of existing sources and design of new
> features, we found some problems in the API and want to address them. I
> believe this should be the last significant API change to data source
> v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>
> SPARK-24252 : Add
> catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
>
> SPARK-24768 : Have a
> built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to
> have in 2.4.
>
> SPARK-23243 :
> Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4
> by replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>> In theory releases happen on a time-based cadence, so it's pretty much
>> wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the code freeze out a few weeks every time. So, kind
>> of a hybrid approach here that works OK.
>>
>> Certainly speak up if you think there's something that really needs to
>> get into 2.4. This is that discuss thread.
>>
>> (BTW I updated the page you mention just yesterday, to reflect the plan
>> suggested in this thread.)
>>
>> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
>> wrote:
>>
>>> Shouldn't this be a discuss thread?
>>>
>>> I'm also happy to see more release managers and agree the time is
>>> getting close, but we should see what features are in progress and see how
>>> close things are and propose a date based on that.  Cutting a branch to
>>> soon just creates more work for committers to push to more branches.
>>>
>>>  http://spark.apache.org/versioning-policy.html mentioned the code
>>> freeze and release branch cut mid-august.
>>>
>>>
>>> Tom
>>>
>>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Reynold Xin
I actually totally agree that we should make sure it should have no impact
on existing code if the feature is not used.


On Tue, Jul 31, 2018 at 1:18 PM Erik Erlandson  wrote:

> I don't have a comprehensive knowledge of the project hydrogen PRs,
> however I've perused them, and they make substantial modifications to
> Spark's core DAG scheduler code.
>
> What I'm wondering is: how high is the confidence level that the
> "traditional" code paths are still stable. Put another way, is it even
> possible to "turn off" or "opt out" of this experimental feature? This
> analogy isn't perfect, but for example the k8s back-end is a major body of
> code, but it has a very small impact on any *core* code paths, and so if
> you opt out of it, it is well understood that you aren't running any
> experimental code.
>
> Looking at the project hydrogen code, I'm less sure the same is true.
> However, maybe there is a clear way to show how it is true.
>
>
> On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
> wrote:
>
>> No reasonable amount of time is likely going to be sufficient to fully
>> vet the code as a PR. I'm not entirely happy with the design and code as
>> they currently are (and I'm still trying to find the time to more publicly
>> express my thoughts and concerns), but I'm fine with them going into 2.4
>> much as they are as long as they go in with proper stability annotations
>> and are understood not to be cast-in-stone final implementations, but
>> rather as a way to get people using them and generating the feedback that
>> is necessary to get us to something more like a final design and
>> implementation.
>>
>> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
>> wrote:
>>
>>>
>>> Barrier mode seems like a high impact feature on Spark's core code: is
>>> one additional week enough time to properly vet this feature?
>>>
>>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>>> joseph.tor...@databricks.com> wrote:
>>>
 Full continuous processing aggregation support ran into unanticipated
 scalability and scheduling problems. We’re planning to overcome those by
 using some of the barrier execution machinery, but since barrier execution
 itself is still in progress the full support isn’t going to make it into
 2.4.

 Jose

 On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda <
 tomasz.gaw...@outlook.com> wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as
> I
> remember, Jose Torres said it should  be easy to perform aggregations
> if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us:
> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
> inner range optimization) but I think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we
> should consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It
> only has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release,
> but there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!),
> and is also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. 
> I
> believe this should be
> >> the last significant API change to data source v2, so great to have
> in Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
I don't have a comprehensive knowledge of the project hydrogen PRs, however
I've perused them, and they make substantial modifications to Spark's core
DAG scheduler code.

What I'm wondering is: how high is the confidence level that the
"traditional" code paths are still stable. Put another way, is it even
possible to "turn off" or "opt out" of this experimental feature? This
analogy isn't perfect, but for example the k8s back-end is a major body of
code, but it has a very small impact on any *core* code paths, and so if
you opt out of it, it is well understood that you aren't running any
experimental code.

Looking at the project hydrogen code, I'm less sure the same is true.
However, maybe there is a clear way to show how it is true.


On Tue, Jul 31, 2018 at 12:03 PM, Mark Hamstra 
wrote:

> No reasonable amount of time is likely going to be sufficient to fully vet
> the code as a PR. I'm not entirely happy with the design and code as they
> currently are (and I'm still trying to find the time to more publicly
> express my thoughts and concerns), but I'm fine with them going into 2.4
> much as they are as long as they go in with proper stability annotations
> and are understood not to be cast-in-stone final implementations, but
> rather as a way to get people using them and generating the feedback that
> is necessary to get us to something more like a final design and
> implementation.
>
> On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson 
> wrote:
>
>>
>> Barrier mode seems like a high impact feature on Spark's core code: is
>> one additional week enough time to properly vet this feature?
>>
>> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
>> joseph.tor...@databricks.com> wrote:
>>
>>> Full continuous processing aggregation support ran into unanticipated
>>> scalability and scheduling problems. We’re planning to overcome those by
>>> using some of the barrier execution machinery, but since barrier execution
>>> itself is still in progress the full support isn’t going to make it into
>>> 2.4.
>>>
>>> Jose
>>>
>>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
>>> wrote:
>>>
 Hi,

 what is the status of Continuous Processing + Aggregations? As far as I
 remember, Jose Torres said it should  be easy to perform aggregations
 if
 coalesce(1) work. IIRC it's already merged to master.

 Is this work in progress? If yes, it would be great to have full
 aggregation/join support in Spark 2.4 in CP.

 Pozdrawiam / Best regards,

 Tomek


 On 2018-07-31 10:43, Petar Zečević wrote:
 > This one is important to us: https://issues.apache.org/
 jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but
 I think it could be useful to others too.
 >
 > It is finished and is ready to be merged (was ready a month ago at
 least).
 >
 > Do you think you could consider including it in 2.4?
 >
 > Petar
 >
 >
 > Wenchen Fan @ 1970-01-01 01:00 CET:
 >
 >> I went through the open JIRA tickets and here is a list that we
 should consider for Spark 2.4:
 >>
 >> High Priority:
 >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
 >> This one is critical to the Spark ecosystem for deep learning. It
 only has a few remaining works and I think we should have it in Spark 2.4.
 >>
 >> Middle Priority:
 >> SPARK-23899: Built-in SQL Function Improvement
 >> We've already added a lot of built-in functions in this release, but
 there are a few useful higher-order functions in progress, like
 `array_except`, `transform`, etc. It would be great if we can get them in
 Spark 2.4.
 >>
 >> SPARK-14220: Build and test Spark against Scala 2.12
 >> Very close to finishing, great to have it in Spark 2.4.
 >>
 >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
 >> This one is there for years (thanks for your patience Michael!), and
 is also close to finishing. Great to have it in 2.4.
 >>
 >> SPARK-24882: data source v2 API improvement
 >> This is to improve the data source v2 API based on what we learned
 during this release. From the migration of existing sources and design of
 new features, we found some problems in the API and want to address them. I
 believe this should be
 >> the last significant API change to data source v2, so great to have
 in Spark 2.4. I'll send a discuss email about it later.
 >>
 >> SPARK-24252: Add catalog support in Data Source V2
 >> This is a very important feature for data source v2, and is
 currently being discussed in the dev list.
 >>
 >> SPARK-24768: Have a built-in AVRO data source implementation
 >> Most of it is done, but date/timestamp support is still missing.
 Great to have in 2.4.
 >>
 >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
 answers
 >> This is a 

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Mark Hamstra
No reasonable amount of time is likely going to be sufficient to fully vet
the code as a PR. I'm not entirely happy with the design and code as they
currently are (and I'm still trying to find the time to more publicly
express my thoughts and concerns), but I'm fine with them going into 2.4
much as they are as long as they go in with proper stability annotations
and are understood not to be cast-in-stone final implementations, but
rather as a way to get people using them and generating the feedback that
is necessary to get us to something more like a final design and
implementation.

On Tue, Jul 31, 2018 at 11:54 AM Erik Erlandson  wrote:

>
> Barrier mode seems like a high impact feature on Spark's core code: is one
> additional week enough time to properly vet this feature?
>
> On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres <
> joseph.tor...@databricks.com> wrote:
>
>> Full continuous processing aggregation support ran into unanticipated
>> scalability and scheduling problems. We’re planning to overcome those by
>> using some of the barrier execution machinery, but since barrier execution
>> itself is still in progress the full support isn’t going to make it into
>> 2.4.
>>
>> Jose
>>
>> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
>> wrote:
>>
>>> Hi,
>>>
>>> what is the status of Continuous Processing + Aggregations? As far as I
>>> remember, Jose Torres said it should  be easy to perform aggregations if
>>> coalesce(1) work. IIRC it's already merged to master.
>>>
>>> Is this work in progress? If yes, it would be great to have full
>>> aggregation/join support in Spark 2.4 in CP.
>>>
>>> Pozdrawiam / Best regards,
>>>
>>> Tomek
>>>
>>>
>>> On 2018-07-31 10:43, Petar Zečević wrote:
>>> > This one is important to us:
>>> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join
>>> inner range optimization) but I think it could be useful to others too.
>>> >
>>> > It is finished and is ready to be merged (was ready a month ago at
>>> least).
>>> >
>>> > Do you think you could consider including it in 2.4?
>>> >
>>> > Petar
>>> >
>>> >
>>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>>> >
>>> >> I went through the open JIRA tickets and here is a list that we
>>> should consider for Spark 2.4:
>>> >>
>>> >> High Priority:
>>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>>> >> This one is critical to the Spark ecosystem for deep learning. It
>>> only has a few remaining works and I think we should have it in Spark 2.4.
>>> >>
>>> >> Middle Priority:
>>> >> SPARK-23899: Built-in SQL Function Improvement
>>> >> We've already added a lot of built-in functions in this release, but
>>> there are a few useful higher-order functions in progress, like
>>> `array_except`, `transform`, etc. It would be great if we can get them in
>>> Spark 2.4.
>>> >>
>>> >> SPARK-14220: Build and test Spark against Scala 2.12
>>> >> Very close to finishing, great to have it in Spark 2.4.
>>> >>
>>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>>> >> This one is there for years (thanks for your patience Michael!), and
>>> is also close to finishing. Great to have it in 2.4.
>>> >>
>>> >> SPARK-24882: data source v2 API improvement
>>> >> This is to improve the data source v2 API based on what we learned
>>> during this release. From the migration of existing sources and design of
>>> new features, we found some problems in the API and want to address them. I
>>> believe this should be
>>> >> the last significant API change to data source v2, so great to have
>>> in Spark 2.4. I'll send a discuss email about it later.
>>> >>
>>> >> SPARK-24252: Add catalog support in Data Source V2
>>> >> This is a very important feature for data source v2, and is currently
>>> being discussed in the dev list.
>>> >>
>>> >> SPARK-24768: Have a built-in AVRO data source implementation
>>> >> Most of it is done, but date/timestamp support is still missing.
>>> Great to have in 2.4.
>>> >>
>>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>>> answers
>>> >> This is a long-standing correctness bug, great to have in 2.4.
>>> >>
>>> >> There are some other important features like the adaptive execution,
>>> streaming SQL, etc., not in the list, since I think we are not able to
>>> finish them before 2.4.
>>> >>
>>> >> Feel free to add more things if you think they are important to Spark
>>> 2.4 by replying to this email.
>>> >>
>>> >> Thanks,
>>> >> Wenchen
>>> >>
>>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>>> >>
>>> >>   In theory releases happen on a time-based cadence, so it's pretty
>>> much wrap up what's ready by the code freeze and ship it. In practice, the
>>> cadence slips frequently, and it's very much a negotiation about what
>>> features should push the
>>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>>> approach here that works OK.
>>> >>
>>> >>   Certainly speak up if you think there's something that really needs
>>> to get into 2.4. 

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Erik Erlandson
Barrier mode seems like a high impact feature on Spark's core code: is one
additional week enough time to properly vet this feature?

On Tue, Jul 31, 2018 at 7:10 AM, Joseph Torres  wrote:

> Full continuous processing aggregation support ran into unanticipated
> scalability and scheduling problems. We’re planning to overcome those by
> using some of the barrier execution machinery, but since barrier execution
> itself is still in progress the full support isn’t going to make it into
> 2.4.
>
> Jose
>
> On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
> wrote:
>
>> Hi,
>>
>> what is the status of Continuous Processing + Aggregations? As far as I
>> remember, Jose Torres said it should  be easy to perform aggregations if
>> coalesce(1) work. IIRC it's already merged to master.
>>
>> Is this work in progress? If yes, it would be great to have full
>> aggregation/join support in Spark 2.4 in CP.
>>
>> Pozdrawiam / Best regards,
>>
>> Tomek
>>
>>
>> On 2018-07-31 10:43, Petar Zečević wrote:
>> > This one is important to us: https://issues.apache.org/
>> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I
>> think it could be useful to others too.
>> >
>> > It is finished and is ready to be merged (was ready a month ago at
>> least).
>> >
>> > Do you think you could consider including it in 2.4?
>> >
>> > Petar
>> >
>> >
>> > Wenchen Fan @ 1970-01-01 01:00 CET:
>> >
>> >> I went through the open JIRA tickets and here is a list that we should
>> consider for Spark 2.4:
>> >>
>> >> High Priority:
>> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> >> This one is critical to the Spark ecosystem for deep learning. It only
>> has a few remaining works and I think we should have it in Spark 2.4.
>> >>
>> >> Middle Priority:
>> >> SPARK-23899: Built-in SQL Function Improvement
>> >> We've already added a lot of built-in functions in this release, but
>> there are a few useful higher-order functions in progress, like
>> `array_except`, `transform`, etc. It would be great if we can get them in
>> Spark 2.4.
>> >>
>> >> SPARK-14220: Build and test Spark against Scala 2.12
>> >> Very close to finishing, great to have it in Spark 2.4.
>> >>
>> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> >> This one is there for years (thanks for your patience Michael!), and
>> is also close to finishing. Great to have it in 2.4.
>> >>
>> >> SPARK-24882: data source v2 API improvement
>> >> This is to improve the data source v2 API based on what we learned
>> during this release. From the migration of existing sources and design of
>> new features, we found some problems in the API and want to address them. I
>> believe this should be
>> >> the last significant API change to data source v2, so great to have in
>> Spark 2.4. I'll send a discuss email about it later.
>> >>
>> >> SPARK-24252: Add catalog support in Data Source V2
>> >> This is a very important feature for data source v2, and is currently
>> being discussed in the dev list.
>> >>
>> >> SPARK-24768: Have a built-in AVRO data source implementation
>> >> Most of it is done, but date/timestamp support is still missing. Great
>> to have in 2.4.
>> >>
>> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
>> answers
>> >> This is a long-standing correctness bug, great to have in 2.4.
>> >>
>> >> There are some other important features like the adaptive execution,
>> streaming SQL, etc., not in the list, since I think we are not able to
>> finish them before 2.4.
>> >>
>> >> Feel free to add more things if you think they are important to Spark
>> 2.4 by replying to this email.
>> >>
>> >> Thanks,
>> >> Wenchen
>> >>
>> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>> >>
>> >>   In theory releases happen on a time-based cadence, so it's pretty
>> much wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the
>> >>   code freeze out a few weeks every time. So, kind of a hybrid
>> approach here that works OK.
>> >>
>> >>   Certainly speak up if you think there's something that really needs
>> to get into 2.4. This is that discuss thread.
>> >>
>> >>   (BTW I updated the page you mention just yesterday, to reflect the
>> plan suggested in this thread.)
>> >>
>> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>>  wrote:
>> >>
>> >>   Shouldn't this be a discuss thread?
>> >>
>> >>   I'm also happy to see more release managers and agree the time is
>> getting close, but we should see what features are in progress and see how
>> close things are and propose a date based on that.  Cutting a branch to
>> soon just creates
>> >>   more work for committers to push to more branches.
>> >>
>> >>http://spark.apache.org/versioning-policy.html mentioned the code
>> freeze and release branch cut mid-august.
>> >>
>> >>   Tom
>> >
>> > 

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Joseph Torres
Full continuous processing aggregation support ran into unanticipated
scalability and scheduling problems. We’re planning to overcome those by
using some of the barrier execution machinery, but since barrier execution
itself is still in progress the full support isn’t going to make it into
2.4.

Jose

On Tue, Jul 31, 2018 at 6:07 AM Tomasz Gawęda 
wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as I
> remember, Jose Torres said it should  be easy to perform aggregations if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us:
> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner
> range optimization) but I think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It only
> has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release, but
> there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. I
> believe this should be
> >> the last significant API change to data source v2, so great to have in
> Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
> >>
> >> SPARK-24768: Have a built-in AVRO data source implementation
> >> Most of it is done, but date/timestamp support is still missing. Great
> to have in 2.4.
> >>
> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
> answers
> >> This is a long-standing correctness bug, great to have in 2.4.
> >>
> >> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
> >>
> >> Feel free to add more things if you think they are important to Spark
> 2.4 by replying to this email.
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
> >>
> >>   In theory releases happen on a time-based cadence, so it's pretty
> much wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the
> >>   code freeze out a few weeks every time. So, kind of a hybrid approach
> here that works OK.
> >>
> >>   Certainly speak up if you think there's something that really needs
> to get into 2.4. This is that discuss thread.
> >>
> >>   (BTW I updated the page you mention just yesterday, to reflect the
> plan suggested in this thread.)
> >>
> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>  wrote:
> >>
> >>   Shouldn't this be a discuss thread?
> >>
> >>   I'm also happy to see more release managers and agree the time is
> getting close, but we should see what features are in progress and see how
> close things are and propose a date based on that.  Cutting a branch to
> soon just creates
> >>   more work for committers to push to more branches.
> >>
> >>http://spark.apache.org/versioning-policy.html mentioned the code
> freeze and release branch cut mid-august.
> >>
> >>   Tom
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Stavros Kontopoulos
I have a PR out for SPARK-14540 (Support Scala 2.12 closures and Java 8
lambdas in ClosureCleaner).
This should allows us to add support for Scala 2.12, I think we can resolve
this long standing issue with 2.4.

Best,
Stavros

On Tue, Jul 31, 2018 at 4:07 PM, Tomasz Gawęda 
wrote:

> Hi,
>
> what is the status of Continuous Processing + Aggregations? As far as I
> remember, Jose Torres said it should  be easy to perform aggregations if
> coalesce(1) work. IIRC it's already merged to master.
>
> Is this work in progress? If yes, it would be great to have full
> aggregation/join support in Spark 2.4 in CP.
>
> Pozdrawiam / Best regards,
>
> Tomek
>
>
> On 2018-07-31 10:43, Petar Zečević wrote:
> > This one is important to us: https://issues.apache.org/
> jira/browse/SPARK-24020 (Sort-merge join inner range optimization) but I
> think it could be useful to others too.
> >
> > It is finished and is ready to be merged (was ready a month ago at
> least).
> >
> > Do you think you could consider including it in 2.4?
> >
> > Petar
> >
> >
> > Wenchen Fan @ 1970-01-01 01:00 CET:
> >
> >> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
> >>
> >> High Priority:
> >> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> >> This one is critical to the Spark ecosystem for deep learning. It only
> has a few remaining works and I think we should have it in Spark 2.4.
> >>
> >> Middle Priority:
> >> SPARK-23899: Built-in SQL Function Improvement
> >> We've already added a lot of built-in functions in this release, but
> there are a few useful higher-order functions in progress, like
> `array_except`, `transform`, etc. It would be great if we can get them in
> Spark 2.4.
> >>
> >> SPARK-14220: Build and test Spark against Scala 2.12
> >> Very close to finishing, great to have it in Spark 2.4.
> >>
> >> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> >> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
> >>
> >> SPARK-24882: data source v2 API improvement
> >> This is to improve the data source v2 API based on what we learned
> during this release. From the migration of existing sources and design of
> new features, we found some problems in the API and want to address them. I
> believe this should be
> >> the last significant API change to data source v2, so great to have in
> Spark 2.4. I'll send a discuss email about it later.
> >>
> >> SPARK-24252: Add catalog support in Data Source V2
> >> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
> >>
> >> SPARK-24768: Have a built-in AVRO data source implementation
> >> Most of it is done, but date/timestamp support is still missing. Great
> to have in 2.4.
> >>
> >> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect
> answers
> >> This is a long-standing correctness bug, great to have in 2.4.
> >>
> >> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
> >>
> >> Feel free to add more things if you think they are important to Spark
> 2.4 by replying to this email.
> >>
> >> Thanks,
> >> Wenchen
> >>
> >> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
> >>
> >>   In theory releases happen on a time-based cadence, so it's pretty
> much wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the
> >>   code freeze out a few weeks every time. So, kind of a hybrid approach
> here that works OK.
> >>
> >>   Certainly speak up if you think there's something that really needs
> to get into 2.4. This is that discuss thread.
> >>
> >>   (BTW I updated the page you mention just yesterday, to reflect the
> plan suggested in this thread.)
> >>
> >>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves
>  wrote:
> >>
> >>   Shouldn't this be a discuss thread?
> >>
> >>   I'm also happy to see more release managers and agree the time is
> getting close, but we should see what features are in progress and see how
> close things are and propose a date based on that.  Cutting a branch to
> soon just creates
> >>   more work for committers to push to more branches.
> >>
> >>http://spark.apache.org/versioning-policy.html mentioned the code
> freeze and release branch cut mid-august.
> >>
> >>   Tom
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Tomasz Gawęda
Hi,

what is the status of Continuous Processing + Aggregations? As far as I 
remember, Jose Torres said it should  be easy to perform aggregations if 
coalesce(1) work. IIRC it's already merged to master.

Is this work in progress? If yes, it would be great to have full 
aggregation/join support in Spark 2.4 in CP.

Pozdrawiam / Best regards,

Tomek


On 2018-07-31 10:43, Petar Zečević wrote:
> This one is important to us: 
> https://issues.apache.org/jira/browse/SPARK-24020 (Sort-merge join inner 
> range optimization) but I think it could be useful to others too.
>
> It is finished and is ready to be merged (was ready a month ago at least).
>
> Do you think you could consider including it in 2.4?
>
> Petar
>
>
> Wenchen Fan @ 1970-01-01 01:00 CET:
>
>> I went through the open JIRA tickets and here is a list that we should 
>> consider for Spark 2.4:
>>
>> High Priority:
>> SPARK-24374: Support Barrier Execution Mode in Apache Spark
>> This one is critical to the Spark ecosystem for deep learning. It only has a 
>> few remaining works and I think we should have it in Spark 2.4.
>>
>> Middle Priority:
>> SPARK-23899: Built-in SQL Function Improvement
>> We've already added a lot of built-in functions in this release, but there 
>> are a few useful higher-order functions in progress, like `array_except`, 
>> `transform`, etc. It would be great if we can get them in Spark 2.4.
>>
>> SPARK-14220: Build and test Spark against Scala 2.12
>> Very close to finishing, great to have it in Spark 2.4.
>>
>> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
>> This one is there for years (thanks for your patience Michael!), and is also 
>> close to finishing. Great to have it in 2.4.
>>
>> SPARK-24882: data source v2 API improvement
>> This is to improve the data source v2 API based on what we learned during 
>> this release. From the migration of existing sources and design of new 
>> features, we found some problems in the API and want to address them. I 
>> believe this should be
>> the last significant API change to data source v2, so great to have in Spark 
>> 2.4. I'll send a discuss email about it later.
>>
>> SPARK-24252: Add catalog support in Data Source V2
>> This is a very important feature for data source v2, and is currently being 
>> discussed in the dev list.
>>
>> SPARK-24768: Have a built-in AVRO data source implementation
>> Most of it is done, but date/timestamp support is still missing. Great to 
>> have in 2.4.
>>
>> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
>> This is a long-standing correctness bug, great to have in 2.4.
>>
>> There are some other important features like the adaptive execution, 
>> streaming SQL, etc., not in the list, since I think we are not able to 
>> finish them before 2.4.
>>
>> Feel free to add more things if you think they are important to Spark 2.4 by 
>> replying to this email.
>>
>> Thanks,
>> Wenchen
>>
>> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>>
>>   In theory releases happen on a time-based cadence, so it's pretty much 
>> wrap up what's ready by the code freeze and ship it. In practice, the 
>> cadence slips frequently, and it's very much a negotiation about what 
>> features should push the
>>   code freeze out a few weeks every time. So, kind of a hybrid approach here 
>> that works OK.
>>
>>   Certainly speak up if you think there's something that really needs to get 
>> into 2.4. This is that discuss thread.
>>
>>   (BTW I updated the page you mention just yesterday, to reflect the plan 
>> suggested in this thread.)
>>
>>   On Mon, Jul 30, 2018 at 9:51 AM Tom Graves  
>> wrote:
>>
>>   Shouldn't this be a discuss thread?
>>
>>   I'm also happy to see more release managers and agree the time is getting 
>> close, but we should see what features are in progress and see how close 
>> things are and propose a date based on that.  Cutting a branch to soon just 
>> creates
>>   more work for committers to push to more branches.
>>
>>http://spark.apache.org/versioning-policy.html mentioned the code freeze 
>> and release branch cut mid-august.
>>
>>   Tom
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>



Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Petar Zečević


This one is important to us: https://issues.apache.org/jira/browse/SPARK-24020 
(Sort-merge join inner range optimization) but I think it could be useful to 
others too. 

It is finished and is ready to be merged (was ready a month ago at least).

Do you think you could consider including it in 2.4?

Petar


Wenchen Fan @ 1970-01-01 01:00 CET:

> I went through the open JIRA tickets and here is a list that we should 
> consider for Spark 2.4:
>
> High Priority:
> SPARK-24374: Support Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has a 
> few remaining works and I think we should have it in Spark 2.4.
>
> Middle Priority:
> SPARK-23899: Built-in SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there 
> are a few useful higher-order functions in progress, like `array_except`, 
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220: Build and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502: Spark SQL reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is also 
> close to finishing. Great to have it in 2.4.
>
> SPARK-24882: data source v2 API improvement
> This is to improve the data source v2 API based on what we learned during 
> this release. From the migration of existing sources and design of new 
> features, we found some problems in the API and want to address them. I 
> believe this should be
> the last significant API change to data source v2, so great to have in Spark 
> 2.4. I'll send a discuss email about it later.
>
> SPARK-24252: Add catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently being 
> discussed in the dev list.
>
> SPARK-24768: Have a built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to 
> have in 2.4.
>
> SPARK-23243: Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution, 
> streaming SQL, etc., not in the list, since I think we are not able to finish 
> them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4 by 
> replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>  In theory releases happen on a time-based cadence, so it's pretty much wrap 
> up what's ready by the code freeze and ship it. In practice, the cadence 
> slips frequently, and it's very much a negotiation about what features should 
> push the
>  code freeze out a few weeks every time. So, kind of a hybrid approach here 
> that works OK. 
>
>  Certainly speak up if you think there's something that really needs to get 
> into 2.4. This is that discuss thread.
>
>  (BTW I updated the page you mention just yesterday, to reflect the plan 
> suggested in this thread.)
>
>  On Mon, Jul 30, 2018 at 9:51 AM Tom Graves  
> wrote:
>
>  Shouldn't this be a discuss thread?  
>
>  I'm also happy to see more release managers and agree the time is getting 
> close, but we should see what features are in progress and see how close 
> things are and propose a date based on that.  Cutting a branch to soon just 
> creates
>  more work for committers to push to more branches. 
>
>   http://spark.apache.org/versioning-policy.html mentioned the code freeze 
> and release branch cut mid-august.
>
>  Tom


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: code freeze and branch cut for Apache Spark 2.4

2018-07-31 Thread Marco Gaido
Hi Wenchen,

I think it would be great to consider also
 - SPARK-24598 :
Datatype overflow conditions gives incorrect result

As it is a correctness bug. What do you think?

Thanks,
Marco

2018-07-31 4:01 GMT+02:00 Wenchen Fan :

> I went through the open JIRA tickets and here is a list that we should
> consider for Spark 2.4:
>
> *High Priority*:
> SPARK-24374 : Support
> Barrier Execution Mode in Apache Spark
> This one is critical to the Spark ecosystem for deep learning. It only has
> a few remaining works and I think we should have it in Spark 2.4.
>
> *Middle Priority*:
> SPARK-23899 : Built-in
> SQL Function Improvement
> We've already added a lot of built-in functions in this release, but there
> are a few useful higher-order functions in progress, like `array_except`,
> `transform`, etc. It would be great if we can get them in Spark 2.4.
>
> SPARK-14220 : Build
> and test Spark against Scala 2.12
> Very close to finishing, great to have it in Spark 2.4.
>
> SPARK-4502 : Spark SQL
> reads unnecessary nested fields from Parquet
> This one is there for years (thanks for your patience Michael!), and is
> also close to finishing. Great to have it in 2.4.
>
> SPARK-24882 : data
> source v2 API improvement
> This is to improve the data source v2 API based on what we learned during
> this release. From the migration of existing sources and design of new
> features, we found some problems in the API and want to address them. I
> believe this should be the last significant API change to data source
> v2, so great to have in Spark 2.4. I'll send a discuss email about it later.
>
> SPARK-24252 : Add
> catalog support in Data Source V2
> This is a very important feature for data source v2, and is currently
> being discussed in the dev list.
>
> SPARK-24768 : Have a
> built-in AVRO data source implementation
> Most of it is done, but date/timestamp support is still missing. Great to
> have in 2.4.
>
> SPARK-23243 :
> Shuffle+Repartition on an RDD could lead to incorrect answers
> This is a long-standing correctness bug, great to have in 2.4.
>
> There are some other important features like the adaptive execution,
> streaming SQL, etc., not in the list, since I think we are not able to
> finish them before 2.4.
>
> Feel free to add more things if you think they are important to Spark 2.4
> by replying to this email.
>
> Thanks,
> Wenchen
>
> On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:
>
>> In theory releases happen on a time-based cadence, so it's pretty much
>> wrap up what's ready by the code freeze and ship it. In practice, the
>> cadence slips frequently, and it's very much a negotiation about what
>> features should push the code freeze out a few weeks every time. So, kind
>> of a hybrid approach here that works OK.
>>
>> Certainly speak up if you think there's something that really needs to
>> get into 2.4. This is that discuss thread.
>>
>> (BTW I updated the page you mention just yesterday, to reflect the plan
>> suggested in this thread.)
>>
>> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
>> wrote:
>>
>>> Shouldn't this be a discuss thread?
>>>
>>> I'm also happy to see more release managers and agree the time is
>>> getting close, but we should see what features are in progress and see how
>>> close things are and propose a date based on that.  Cutting a branch to
>>> soon just creates more work for committers to push to more branches.
>>>
>>>  http://spark.apache.org/versioning-policy.html mentioned the code
>>> freeze and release branch cut mid-august.
>>>
>>>
>>> Tom
>>>
>>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Wenchen Fan
I went through the open JIRA tickets and here is a list that we should
consider for Spark 2.4:

*High Priority*:
SPARK-24374 : Support
Barrier Execution Mode in Apache Spark
This one is critical to the Spark ecosystem for deep learning. It only has
a few remaining works and I think we should have it in Spark 2.4.

*Middle Priority*:
SPARK-23899 : Built-in
SQL Function Improvement
We've already added a lot of built-in functions in this release, but there
are a few useful higher-order functions in progress, like `array_except`,
`transform`, etc. It would be great if we can get them in Spark 2.4.

SPARK-14220 : Build and
test Spark against Scala 2.12
Very close to finishing, great to have it in Spark 2.4.

SPARK-4502 : Spark SQL
reads unnecessary nested fields from Parquet
This one is there for years (thanks for your patience Michael!), and is
also close to finishing. Great to have it in 2.4.

SPARK-24882 : data
source v2 API improvement
This is to improve the data source v2 API based on what we learned during
this release. From the migration of existing sources and design of new
features, we found some problems in the API and want to address them. I
believe this should be the last significant API change to data source
v2, so great to have in Spark 2.4. I'll send a discuss email about it later.

SPARK-24252 : Add
catalog support in Data Source V2
This is a very important feature for data source v2, and is currently being
discussed in the dev list.

SPARK-24768 : Have a
built-in AVRO data source implementation
Most of it is done, but date/timestamp support is still missing. Great to
have in 2.4.

SPARK-23243 :
Shuffle+Repartition on an RDD could lead to incorrect answers
This is a long-standing correctness bug, great to have in 2.4.

There are some other important features like the adaptive execution,
streaming SQL, etc., not in the list, since I think we are not able to
finish them before 2.4.

Feel free to add more things if you think they are important to Spark 2.4
by replying to this email.

Thanks,
Wenchen

On Mon, Jul 30, 2018 at 11:00 PM Sean Owen  wrote:

> In theory releases happen on a time-based cadence, so it's pretty much
> wrap up what's ready by the code freeze and ship it. In practice, the
> cadence slips frequently, and it's very much a negotiation about what
> features should push the code freeze out a few weeks every time. So, kind
> of a hybrid approach here that works OK.
>
> Certainly speak up if you think there's something that really needs to get
> into 2.4. This is that discuss thread.
>
> (BTW I updated the page you mention just yesterday, to reflect the plan
> suggested in this thread.)
>
> On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
> wrote:
>
>> Shouldn't this be a discuss thread?
>>
>> I'm also happy to see more release managers and agree the time is getting
>> close, but we should see what features are in progress and see how close
>> things are and propose a date based on that.  Cutting a branch to soon just
>> creates more work for committers to push to more branches.
>>
>>  http://spark.apache.org/versioning-policy.html mentioned the code
>> freeze and release branch cut mid-august.
>>
>>
>> Tom
>>
>>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Sean Owen
In theory releases happen on a time-based cadence, so it's pretty much wrap
up what's ready by the code freeze and ship it. In practice, the cadence
slips frequently, and it's very much a negotiation about what features
should push the code freeze out a few weeks every time. So, kind of a
hybrid approach here that works OK.

Certainly speak up if you think there's something that really needs to get
into 2.4. This is that discuss thread.

(BTW I updated the page you mention just yesterday, to reflect the plan
suggested in this thread.)

On Mon, Jul 30, 2018 at 9:51 AM Tom Graves 
wrote:

> Shouldn't this be a discuss thread?
>
> I'm also happy to see more release managers and agree the time is getting
> close, but we should see what features are in progress and see how close
> things are and propose a date based on that.  Cutting a branch to soon just
> creates more work for committers to push to more branches.
>
>  http://spark.apache.org/versioning-policy.html mentioned the code freeze
> and release branch cut mid-august.
>
>
> Tom
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-30 Thread Tom Graves
 Shouldn't this be a discuss thread?  
I'm also happy to see more release managers and agree the time is getting 
close, but we should see what features are in progress and see how close things 
are and propose a date based on that.  Cutting a branch to soon just creates 
more work for committers to push to more branches. 
 http://spark.apache.org/versioning-policy.html mentioned the code freeze and 
release branch cut mid-august.

Tom
On Friday, July 6, 2018, 11:47:35 AM CDT, Reynold Xin  
wrote:  
 
 FYI 6 mo is coming up soon since the last release. We will cut the branch and 
code freeze on Aug 1st in order to get 2.4 out on time.
  

Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Holden Karau
I’m excited to have more folks rotate through release manager :)

On Sun, Jul 29, 2018 at 3:57 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> +1. That would great!
>
> Thanks,
> Stavros
>
> On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan  wrote:
>
>> If no one objects, how about we make the code freeze one week later(Aug
>> 8th)?
>>
>> BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
>> I'm familiar with most of the major features targeted for the 2.4 release.
>> I also have a lot of free time during this release timeframe and should be
>> able to figure out problems that may appear during the release.
>>
>> Thanks,
>> Wenchen
>>
>> On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> Extending code freeze date would be great for me too, I am working on a
>>> PR for supporting scala 2.12, I am close but need some more time.
>>> We could get it into 2.4.
>>>
>>> Stavros
>>>
>>> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan 
>>> wrote:
>>>
 This seems fine to me.

 BTW Ryan Blue and I are working on some data source v2 stuff and
 hopefully we can get more things done with one more week.

 Thanks,
 Wenchen

 On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
 wrote:

> Xiangrui and I are leading an effort to implement a highly desirable
> feature, Barrier Execution Mode.
> https://issues.apache.org/jira/browse/SPARK-24374. This introduces a
> new scheduling model to Apache Spark so users can properly embed
> distributed DL training as a Spark stage to simplify the distributed
> training workflow. The prototype has been demoed in the Spark Summit
> keynote. This new feature got a very positive feedback from the whole
> community. The design doc and pull requests got more comments than we
> initially anticipated. We want to finish this feature in the upcoming
> release, Spark 2.4. Would it be possible to have an extension of code
> freeze for a week?
>
> Thanks,
>
> Xingbo
>
> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>
>> FYI 6 mo is coming up soon since the last release. We will cut the
>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>
>>
>
>>>
>>>
>>> --
>>> Stavros Kontopoulos
>>>
>>> *Senior Software Engineer*
>>> *Lightbend, Inc.*
>>>
>>> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
>>> *e: stavros.kontopou...@lightbend.com* 
>>>
>>>
>>>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
> --
Twitter: https://twitter.com/holdenkarau


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Stavros Kontopoulos
+1. That would great!

Thanks,
Stavros

On Sun, Jul 29, 2018 at 5:05 PM, Wenchen Fan  wrote:

> If no one objects, how about we make the code freeze one week later(Aug
> 8th)?
>
> BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
> I'm familiar with most of the major features targeted for the 2.4 release.
> I also have a lot of free time during this release timeframe and should be
> able to figure out problems that may appear during the release.
>
> Thanks,
> Wenchen
>
> On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos  lightbend.com> wrote:
>
>> Extending code freeze date would be great for me too, I am working on a
>> PR for supporting scala 2.12, I am close but need some more time.
>> We could get it into 2.4.
>>
>> Stavros
>>
>> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:
>>
>>> This seems fine to me.
>>>
>>> BTW Ryan Blue and I are working on some data source v2 stuff and
>>> hopefully we can get more things done with one more week.
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
>>> wrote:
>>>
 Xiangrui and I are leading an effort to implement a highly desirable
 feature, Barrier Execution Mode. https://issues.apache.org/
 jira/browse/SPARK-24374. This introduces a new scheduling model to
 Apache Spark so users can properly embed distributed DL training as a Spark
 stage to simplify the distributed training workflow. The prototype has been
 demoed in the Spark Summit keynote. This new feature got a very positive
 feedback from the whole community. The design doc and pull requests got
 more comments than we initially anticipated. We want to finish this feature
 in the upcoming release, Spark 2.4. Would it be possible to have an
 extension of code freeze for a week?

 Thanks,

 Xingbo

 2018-07-07 0:47 GMT+08:00 Reynold Xin :

> FYI 6 mo is coming up soon since the last release. We will cut the
> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>
>

>>
>>
>> --
>> Stavros Kontopoulos
>>
>> *Senior Software Engineer*
>> *Lightbend, Inc.*
>>
>> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
>> *e: stavros.kontopou...@lightbend.com* 
>>
>>
>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-29 Thread Wenchen Fan
If no one objects, how about we make the code freeze one week later(Aug
8th)?

BTW I'd like to volunteer to serve as the release manager for Spark 2.4.
I'm familiar with most of the major features targeted for the 2.4 release.
I also have a lot of free time during this release timeframe and should be
able to figure out problems that may appear during the release.

Thanks,
Wenchen

On Fri, Jul 27, 2018 at 11:27 PM Stavros Kontopoulos <
stavros.kontopou...@lightbend.com> wrote:

> Extending code freeze date would be great for me too, I am working on a PR
> for supporting scala 2.12, I am close but need some more time.
> We could get it into 2.4.
>
> Stavros
>
> On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:
>
>> This seems fine to me.
>>
>> BTW Ryan Blue and I are working on some data source v2 stuff and
>> hopefully we can get more things done with one more week.
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
>> wrote:
>>
>>> Xiangrui and I are leading an effort to implement a highly desirable
>>> feature, Barrier Execution Mode.
>>> https://issues.apache.org/jira/browse/SPARK-24374. This introduces a
>>> new scheduling model to Apache Spark so users can properly embed
>>> distributed DL training as a Spark stage to simplify the distributed
>>> training workflow. The prototype has been demoed in the Spark Summit
>>> keynote. This new feature got a very positive feedback from the whole
>>> community. The design doc and pull requests got more comments than we
>>> initially anticipated. We want to finish this feature in the upcoming
>>> release, Spark 2.4. Would it be possible to have an extension of code
>>> freeze for a week?
>>>
>>> Thanks,
>>>
>>> Xingbo
>>>
>>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>>
 FYI 6 mo is coming up soon since the last release. We will cut the
 branch and code freeze on Aug 1st in order to get 2.4 out on time.


>>>
>
>
> --
> Stavros Kontopoulos
>
> *Senior Software Engineer*
> *Lightbend, Inc.*
>
> *p:  +30 6977967274 <%2B1%20650%20678%200020>*
> *e: stavros.kontopou...@lightbend.com* 
>
>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Stavros Kontopoulos
Extending code freeze date would be great for me too, I am working on a PR
for supporting scala 2.12, I am close but need some more time.
We could get it into 2.4.

Stavros

On Fri, Jul 27, 2018 at 9:27 AM, Wenchen Fan  wrote:

> This seems fine to me.
>
> BTW Ryan Blue and I are working on some data source v2 stuff and hopefully
> we can get more things done with one more week.
>
> Thanks,
> Wenchen
>
> On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang 
> wrote:
>
>> Xiangrui and I are leading an effort to implement a highly desirable
>> feature, Barrier Execution Mode. https://issues.apache.org/
>> jira/browse/SPARK-24374. This introduces a new scheduling model to
>> Apache Spark so users can properly embed distributed DL training as a Spark
>> stage to simplify the distributed training workflow. The prototype has been
>> demoed in the Spark Summit keynote. This new feature got a very positive
>> feedback from the whole community. The design doc and pull requests got
>> more comments than we initially anticipated. We want to finish this feature
>> in the upcoming release, Spark 2.4. Would it be possible to have an
>> extension of code freeze for a week?
>>
>> Thanks,
>>
>> Xingbo
>>
>> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>>
>>> FYI 6 mo is coming up soon since the last release. We will cut the
>>> branch and code freeze on Aug 1st in order to get 2.4 out on time.
>>>
>>>
>>


-- 
Stavros Kontopoulos

*Senior Software Engineer*
*Lightbend, Inc.*

*p:  +30 6977967274 <%2B1%20650%20678%200020>*
*e: stavros.kontopou...@lightbend.com* 


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-27 Thread Wenchen Fan
This seems fine to me.

BTW Ryan Blue and I are working on some data source v2 stuff and hopefully
we can get more things done with one more week.

Thanks,
Wenchen

On Thu, Jul 26, 2018 at 1:14 PM Xingbo Jiang  wrote:

> Xiangrui and I are leading an effort to implement a highly desirable
> feature, Barrier Execution Mode.
> https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
> scheduling model to Apache Spark so users can properly embed distributed DL
> training as a Spark stage to simplify the distributed training workflow.
> The prototype has been demoed in the Spark Summit keynote. This new feature
> got a very positive feedback from the whole community. The design doc and
> pull requests got more comments than we initially anticipated. We want to
> finish this feature in the upcoming release, Spark 2.4. Would it be
> possible to have an extension of code freeze for a week?
>
> Thanks,
>
> Xingbo
>
> 2018-07-07 0:47 GMT+08:00 Reynold Xin :
>
>> FYI 6 mo is coming up soon since the last release. We will cut the branch
>> and code freeze on Aug 1st in order to get 2.4 out on time.
>>
>>
>


Re: code freeze and branch cut for Apache Spark 2.4

2018-07-25 Thread Xingbo Jiang
Xiangrui and I are leading an effort to implement a highly desirable
feature, Barrier Execution Mode.
https://issues.apache.org/jira/browse/SPARK-24374. This introduces a new
scheduling model to Apache Spark so users can properly embed distributed DL
training as a Spark stage to simplify the distributed training workflow.
The prototype has been demoed in the Spark Summit keynote. This new feature
got a very positive feedback from the whole community. The design doc and
pull requests got more comments than we initially anticipated. We want to
finish this feature in the upcoming release, Spark 2.4. Would it be
possible to have an extension of code freeze for a week?

Thanks,

Xingbo

2018-07-07 0:47 GMT+08:00 Reynold Xin :

> FYI 6 mo is coming up soon since the last release. We will cut the branch
> and code freeze on Aug 1st in order to get 2.4 out on time.
>
>


code freeze and branch cut for Apache Spark 2.4

2018-07-06 Thread Reynold Xin
FYI 6 mo is coming up soon since the last release. We will cut the branch
and code freeze on Aug 1st in order to get 2.4 out on time.