Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Jungtaek Lim
Thanks for sharing the blockers, Wenchen. SPARK-31404 has sub-tasks, hence
that means all sub-tasks are blockers for this release, do I understand
that correctly?

Xiao, I sincerely respect the practice the Spark community has been done,
so please treat it as 2 cents. Just would like to see the way how the
community could focus on the such huge release - even only counting bugs +
improvement + new features, nearly 2000 issues has been resolved "only" in
Spark 3.0.0. The volume seems to be quite different from usual bugfix and
minor releases which feels that special cares are needed.


On Fri, Apr 10, 2020 at 1:22 PM Wenchen Fan  wrote:

> The ongoing critical issues I'm aware of are:
> SPARK-31257 : Fix
> ambiguous two different CREATE TABLE syntaxes
> SPARK-31404 : backward
> compatibility issues after switching to Proleptic Gregorian calendar
> SPARK-31399 : closure
> cleaner is broken in Spark 3.0
> SPARK-28067 :
> Incorrect results in decimal aggregation with whole-stage codegen enabled
>
> That said, I'm -1 (binding) to RC1
>
> Please reply to this thread if you know more critical issues that should
> be fixed before 3.0.
>
> Thanks,
> Wenchen
>
>
> On Fri, Apr 10, 2020 at 10:01 AM Xiao Li  wrote:
>
>> Only the low-risk or high-value bug fixes, and the documentation changes
>> are allowed to merge to branch-3.0. I expect all the committers are
>> following the same rules like what we did in the previous releases.
>>
>> Xiao
>>
>> On Thu, Apr 9, 2020 at 6:13 PM Jungtaek Lim 
>> wrote:
>>
>>> Looks like around 80 commits have been landed to branch-3.0 after we cut
>>> RC1 (I know many of them are to version the config, as well as add docs).
>>> Shall we announce the blocker-only phase and maintain the list of blockers
>>> to restrict the changes on the branch? This would make everyone being
>>> hesitate to test the RC1 (see how many people have been tested RC1 in this
>>> thread), as they probably need to test the same with RC2.
>>>
>>> On Thu, Apr 9, 2020 at 5:50 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 I went through some manually tests for the new features of Structured
 Streaming in Spark 3.0.0. (Please let me know if there're more features
 we'd like to test manually.)

 * file source cleanup - both “archive" and “delete" work. Query fails
 as expected when the input directory is the output directory of file sink.
 * kafka source/sink - “header” works for both source and sink, "group
 id prefix" and “static group id” work, confirmed start offset by timestamp
 works for streaming case
 * event log stuffs with streaming query - enabled it, confirmed
 compaction works, and SHS can read compacted event logs, and downloading
 event log in SHS works as zipping the event log directory. original
 functionalities with single event log file work as well.

 Looks good, though there're still plenty of commits pushed to
 branch-3.0 after RC1 which feels me that it may not be safe to carry over
 the test result for RC1 to RC2.

 On Sat, Apr 4, 2020 at 12:49 AM Sean Owen  wrote:

> Aside from the other issues mentioned here, which probably do require
> another RC, this looks pretty good to me.
>
> I built on Ubuntu 19 and ran with Java 11, -Pspark-ganglia-lgpl
> -Pkinesis-asl -Phadoop-3.2 -Phive-2.3 -Pyarn -Pmesos -Pkubernetes
> -Phive-thriftserver -Djava.version=11
>
> I did see the following test failures, but as usual, I'm not sure
> whether it's specific to me. Anyone else see these, particularly the R
> warnings?
>
>
> PythonUDFSuite:
> org.apache.spark.sql.execution.python.PythonUDFSuite *** ABORTED ***
>   java.lang.RuntimeException: Unable to load a Suite class that was
> discovered in the runpath:
> org.apache.spark.sql.execution.python.PythonUDFSuite
>   at
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81)
>   at
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>   at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>
>
> - SPARK-25158: Executor accidentally exit because
> ScriptTransformationWriterThread throw 

Re: DSv2 & DataSourceRegister

2020-04-09 Thread Andrew Melo
Hi all,

I've opened a WIP PR here https://github.com/apache/spark/pull/28159
I'm a novice at Scala, so I'm sure the code isn't idiomatic, but it
behaves functionally how I'd expect. I've added unit tests to the PR,
but if you would like to verify the intended functionality, I've
uploaded a fat jar with my datasource to
http://mirror.accre.vanderbilt.edu/spark/laurelin-both.jar and an
example input file to
https://github.com/spark-root/laurelin/raw/master/testdata/stdvector.root.
The following in spark-shell successfully chooses the proper plugin
implementation based on the spark version:

spark.read.format("root").option("tree","tvec").load("stdvector.root")

Additionally, I did a very rough POC for spark2.4, which you can find
at https://github.com/PerilousApricot/spark/tree/feature/registerv2-24
. The same jar/inputfile works there as well.

Thanks again,
Andrew

On Wed, Apr 8, 2020 at 10:27 AM Andrew Melo  wrote:
>
> On Wed, Apr 8, 2020 at 8:35 AM Wenchen Fan  wrote:
> >
> > It would be good to support your use case, but I'm not sure how to 
> > accomplish that. Can you open a PR so that we can discuss it in detail? How 
> > can `public Class getImplementation();` be 
> > possible in 3.0 as there is no `DataSourceV2`?
>
> You're right, that was a typo. Since the whole point is to separate
> the (stable) registration interface from the (evolving) DSv2 API, it
> defeats the purpose to then directly reference the DSv2 API within the
> registration interface.
>
> I'll put together a PR.
>
> Thanks again,
> Andrew
>
> >
> > On Wed, Apr 8, 2020 at 1:12 PM Andrew Melo  wrote:
> >>
> >> Hello
> >>
> >> On Tue, Apr 7, 2020 at 23:16 Wenchen Fan  wrote:
> >>>
> >>> Are you going to provide a single artifact for Spark 2.4 and 3.0? I'm not 
> >>> sure this is possible as the DS V2 API is very different in 3.0, e.g. 
> >>> there is no `DataSourceV2` anymore, and you should implement 
> >>> `TableProvider` (if you don't have database/table).
> >>
> >>
> >> Correct, I've got a single jar for both Spark 2.4 and 3.0, with a toplevel 
> >> Root_v24 (implements DataSourceV2) and Root_v30 (implements 
> >> TableProvider). I can load this jar in a both pyspark 2.4 and 3.0 and it 
> >> works well -- as long as I remove the registration from META-INF and pass 
> >> in the full class name to the DataFrameReader.
> >>
> >> Thanks
> >> Andrew
> >>
> >>>
> >>> On Wed, Apr 8, 2020 at 6:58 AM Andrew Melo  wrote:
> 
>  Hi Ryan,
> 
>  On Tue, Apr 7, 2020 at 5:21 PM Ryan Blue  wrote:
>  >
>  > Hi Andrew,
>  >
>  > With DataSourceV2, I recommend plugging in a catalog instead of using 
>  > DataSource. As you've noticed, the way that you plug in data sources 
>  > isn't very flexible. That's one of the reasons why we changed the 
>  > plugin system and made it possible to use named catalogs that load 
>  > implementations based on configuration properties.
>  >
>  > I think it's fine to consider how to patch the registration trait, but 
>  > I really don't recommend continuing to identify table implementations 
>  > directly by name.
> 
>  Can you be a bit more concrete with what you mean by plugging a
>  catalog instead of a DataSource? We have been using
>  sc.read.format("root").load([list of paths]) which works well. Since
>  we don't have a database or tables, I don't fully understand what's
>  different between the two interfaces that would make us prefer one or
>  another.
> 
>  That being said, WRT the registration trait, if I'm not misreading
>  createTable() and friends, the "source" parameter is resolved the same
>  way as DataFrameReader.format(), so a solution that helps out
>  registration should help both interfaces.
> 
>  Thanks again,
>  Andrew
> 
>  >
>  > On Tue, Apr 7, 2020 at 12:26 PM Andrew Melo  
>  > wrote:
>  >>
>  >> Hi all,
>  >>
>  >> I posted an improvement ticket in JIRA and Hyukjin Kwon requested I
>  >> send an email to the dev list for discussion.
>  >>
>  >> As the DSv2 API evolves, some breaking changes are occasionally made
>  >> to the API. It's possible to split a plugin into a "common" part and
>  >> multiple version-specific parts and this works OK to have a single
>  >> artifact for users, as long as they write out the fully qualified
>  >> classname as the DataFrame format(). The one part that can't be
>  >> currently worked around is the DataSourceRegister trait. Since classes
>  >> which implement DataSourceRegister must also implement DataSourceV2
>  >> (and its mixins), changes to those interfaces cause the ServiceLoader
>  >> to fail when it attempts to load the "wrong" DataSourceV2 class.
>  >> (there's also an additional prohibition against multiple
>  >> implementations having the same ShortName in
>  >> org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource).
> 

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Wenchen Fan
The ongoing critical issues I'm aware of are:
SPARK-31257 : Fix
ambiguous two different CREATE TABLE syntaxes
SPARK-31404 : backward
compatibility issues after switching to Proleptic Gregorian calendar
SPARK-31399 : closure
cleaner is broken in Spark 3.0
SPARK-28067 : Incorrect
results in decimal aggregation with whole-stage codegen enabled

That said, I'm -1 (binding) to RC1

Please reply to this thread if you know more critical issues that should be
fixed before 3.0.

Thanks,
Wenchen


On Fri, Apr 10, 2020 at 10:01 AM Xiao Li  wrote:

> Only the low-risk or high-value bug fixes, and the documentation changes
> are allowed to merge to branch-3.0. I expect all the committers are
> following the same rules like what we did in the previous releases.
>
> Xiao
>
> On Thu, Apr 9, 2020 at 6:13 PM Jungtaek Lim 
> wrote:
>
>> Looks like around 80 commits have been landed to branch-3.0 after we cut
>> RC1 (I know many of them are to version the config, as well as add docs).
>> Shall we announce the blocker-only phase and maintain the list of blockers
>> to restrict the changes on the branch? This would make everyone being
>> hesitate to test the RC1 (see how many people have been tested RC1 in this
>> thread), as they probably need to test the same with RC2.
>>
>> On Thu, Apr 9, 2020 at 5:50 PM Jungtaek Lim 
>> wrote:
>>
>>> I went through some manually tests for the new features of Structured
>>> Streaming in Spark 3.0.0. (Please let me know if there're more features
>>> we'd like to test manually.)
>>>
>>> * file source cleanup - both “archive" and “delete" work. Query fails as
>>> expected when the input directory is the output directory of file sink.
>>> * kafka source/sink - “header” works for both source and sink, "group id
>>> prefix" and “static group id” work, confirmed start offset by timestamp
>>> works for streaming case
>>> * event log stuffs with streaming query - enabled it, confirmed
>>> compaction works, and SHS can read compacted event logs, and downloading
>>> event log in SHS works as zipping the event log directory. original
>>> functionalities with single event log file work as well.
>>>
>>> Looks good, though there're still plenty of commits pushed to branch-3.0
>>> after RC1 which feels me that it may not be safe to carry over the
>>> test result for RC1 to RC2.
>>>
>>> On Sat, Apr 4, 2020 at 12:49 AM Sean Owen  wrote:
>>>
 Aside from the other issues mentioned here, which probably do require
 another RC, this looks pretty good to me.

 I built on Ubuntu 19 and ran with Java 11, -Pspark-ganglia-lgpl
 -Pkinesis-asl -Phadoop-3.2 -Phive-2.3 -Pyarn -Pmesos -Pkubernetes
 -Phive-thriftserver -Djava.version=11

 I did see the following test failures, but as usual, I'm not sure
 whether it's specific to me. Anyone else see these, particularly the R
 warnings?


 PythonUDFSuite:
 org.apache.spark.sql.execution.python.PythonUDFSuite *** ABORTED ***
   java.lang.RuntimeException: Unable to load a Suite class that was
 discovered in the runpath:
 org.apache.spark.sql.execution.python.PythonUDFSuite
   at
 org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81)
   at
 org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
   at
 scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
   at scala.collection.Iterator.foreach(Iterator.scala:941)
   at scala.collection.Iterator.foreach$(Iterator.scala:941)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
   at scala.collection.TraversableLike.map(TraversableLike.scala:238)


 - SPARK-25158: Executor accidentally exit because
 ScriptTransformationWriterThread throw Exception *** FAILED ***
   Expected exception org.apache.spark.SparkException to be thrown, but
 no exception was thrown (SQLQuerySuite.scala:2384)


 * checking for missing documentation entries ... WARNING
 Undocumented code objects:
   ‘%<=>%’ ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’
   ‘approx_count_distinct’ ‘arrange’ ‘array_contains’ ‘array_distinct’
 ...
  WARNING
 ‘qpdf’ is needed for checks on size reduction of PDFs

 On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin 
 wrote:
 >
 > Please vote on releasing the following candidate as Apache Spark
 version 3.0.0.
 >
 > The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if
 a majority +1 PMC votes are cast, with a 

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Xiao Li
Only the low-risk or high-value bug fixes, and the documentation changes
are allowed to merge to branch-3.0. I expect all the committers are
following the same rules like what we did in the previous releases.

Xiao

On Thu, Apr 9, 2020 at 6:13 PM Jungtaek Lim 
wrote:

> Looks like around 80 commits have been landed to branch-3.0 after we cut
> RC1 (I know many of them are to version the config, as well as add docs).
> Shall we announce the blocker-only phase and maintain the list of blockers
> to restrict the changes on the branch? This would make everyone being
> hesitate to test the RC1 (see how many people have been tested RC1 in this
> thread), as they probably need to test the same with RC2.
>
> On Thu, Apr 9, 2020 at 5:50 PM Jungtaek Lim 
> wrote:
>
>> I went through some manually tests for the new features of Structured
>> Streaming in Spark 3.0.0. (Please let me know if there're more features
>> we'd like to test manually.)
>>
>> * file source cleanup - both “archive" and “delete" work. Query fails as
>> expected when the input directory is the output directory of file sink.
>> * kafka source/sink - “header” works for both source and sink, "group id
>> prefix" and “static group id” work, confirmed start offset by timestamp
>> works for streaming case
>> * event log stuffs with streaming query - enabled it, confirmed
>> compaction works, and SHS can read compacted event logs, and downloading
>> event log in SHS works as zipping the event log directory. original
>> functionalities with single event log file work as well.
>>
>> Looks good, though there're still plenty of commits pushed to branch-3.0
>> after RC1 which feels me that it may not be safe to carry over the
>> test result for RC1 to RC2.
>>
>> On Sat, Apr 4, 2020 at 12:49 AM Sean Owen  wrote:
>>
>>> Aside from the other issues mentioned here, which probably do require
>>> another RC, this looks pretty good to me.
>>>
>>> I built on Ubuntu 19 and ran with Java 11, -Pspark-ganglia-lgpl
>>> -Pkinesis-asl -Phadoop-3.2 -Phive-2.3 -Pyarn -Pmesos -Pkubernetes
>>> -Phive-thriftserver -Djava.version=11
>>>
>>> I did see the following test failures, but as usual, I'm not sure
>>> whether it's specific to me. Anyone else see these, particularly the R
>>> warnings?
>>>
>>>
>>> PythonUDFSuite:
>>> org.apache.spark.sql.execution.python.PythonUDFSuite *** ABORTED ***
>>>   java.lang.RuntimeException: Unable to load a Suite class that was
>>> discovered in the runpath:
>>> org.apache.spark.sql.execution.python.PythonUDFSuite
>>>   at
>>> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81)
>>>   at
>>> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>>>   at
>>> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>>>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>>>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>>>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>>>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>>>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>>>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>>>
>>>
>>> - SPARK-25158: Executor accidentally exit because
>>> ScriptTransformationWriterThread throw Exception *** FAILED ***
>>>   Expected exception org.apache.spark.SparkException to be thrown, but
>>> no exception was thrown (SQLQuerySuite.scala:2384)
>>>
>>>
>>> * checking for missing documentation entries ... WARNING
>>> Undocumented code objects:
>>>   ‘%<=>%’ ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’
>>>   ‘approx_count_distinct’ ‘arrange’ ‘array_contains’ ‘array_distinct’
>>> ...
>>>  WARNING
>>> ‘qpdf’ is needed for checks on size reduction of PDFs
>>>
>>> On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 3.0.0.
>>> >
>>> > The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 3.0.0
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v3.0.0-rc1 (commit
>>> 6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):
>>> > https://github.com/apache/spark/tree/v3.0.0-rc1
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1341/
>>> >
>>> > The documentation 

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Jungtaek Lim
Looks like around 80 commits have been landed to branch-3.0 after we cut
RC1 (I know many of them are to version the config, as well as add docs).
Shall we announce the blocker-only phase and maintain the list of blockers
to restrict the changes on the branch? This would make everyone being
hesitate to test the RC1 (see how many people have been tested RC1 in this
thread), as they probably need to test the same with RC2.

On Thu, Apr 9, 2020 at 5:50 PM Jungtaek Lim 
wrote:

> I went through some manually tests for the new features of Structured
> Streaming in Spark 3.0.0. (Please let me know if there're more features
> we'd like to test manually.)
>
> * file source cleanup - both “archive" and “delete" work. Query fails as
> expected when the input directory is the output directory of file sink.
> * kafka source/sink - “header” works for both source and sink, "group id
> prefix" and “static group id” work, confirmed start offset by timestamp
> works for streaming case
> * event log stuffs with streaming query - enabled it, confirmed compaction
> works, and SHS can read compacted event logs, and downloading event log in
> SHS works as zipping the event log directory. original functionalities with
> single event log file work as well.
>
> Looks good, though there're still plenty of commits pushed to branch-3.0
> after RC1 which feels me that it may not be safe to carry over the
> test result for RC1 to RC2.
>
> On Sat, Apr 4, 2020 at 12:49 AM Sean Owen  wrote:
>
>> Aside from the other issues mentioned here, which probably do require
>> another RC, this looks pretty good to me.
>>
>> I built on Ubuntu 19 and ran with Java 11, -Pspark-ganglia-lgpl
>> -Pkinesis-asl -Phadoop-3.2 -Phive-2.3 -Pyarn -Pmesos -Pkubernetes
>> -Phive-thriftserver -Djava.version=11
>>
>> I did see the following test failures, but as usual, I'm not sure
>> whether it's specific to me. Anyone else see these, particularly the R
>> warnings?
>>
>>
>> PythonUDFSuite:
>> org.apache.spark.sql.execution.python.PythonUDFSuite *** ABORTED ***
>>   java.lang.RuntimeException: Unable to load a Suite class that was
>> discovered in the runpath:
>> org.apache.spark.sql.execution.python.PythonUDFSuite
>>   at
>> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81)
>>   at
>> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>>   at
>> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>>
>>
>> - SPARK-25158: Executor accidentally exit because
>> ScriptTransformationWriterThread throw Exception *** FAILED ***
>>   Expected exception org.apache.spark.SparkException to be thrown, but
>> no exception was thrown (SQLQuerySuite.scala:2384)
>>
>>
>> * checking for missing documentation entries ... WARNING
>> Undocumented code objects:
>>   ‘%<=>%’ ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’
>>   ‘approx_count_distinct’ ‘arrange’ ‘array_contains’ ‘array_distinct’
>> ...
>>  WARNING
>> ‘qpdf’ is needed for checks on size reduction of PDFs
>>
>> On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.0.0.
>> >
>> > The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.0.0
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.0.0-rc1 (commit
>> 6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):
>> > https://github.com/apache/spark/tree/v3.0.0-rc1
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1341/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/
>> >
>> > The list of bug fixes going into 2.4.5 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>> >
>> > This release is using the release script of the tag v3.0.0-rc1.
>> >
>> >
>> > FAQ
>> >
>> > =

Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-09 Thread Jungtaek Lim
I went through some manually tests for the new features of Structured
Streaming in Spark 3.0.0. (Please let me know if there're more features
we'd like to test manually.)

* file source cleanup - both “archive" and “delete" work. Query fails as
expected when the input directory is the output directory of file sink.
* kafka source/sink - “header” works for both source and sink, "group id
prefix" and “static group id” work, confirmed start offset by timestamp
works for streaming case
* event log stuffs with streaming query - enabled it, confirmed compaction
works, and SHS can read compacted event logs, and downloading event log in
SHS works as zipping the event log directory. original functionalities with
single event log file work as well.

Looks good, though there're still plenty of commits pushed to branch-3.0
after RC1 which feels me that it may not be safe to carry over the
test result for RC1 to RC2.

On Sat, Apr 4, 2020 at 12:49 AM Sean Owen  wrote:

> Aside from the other issues mentioned here, which probably do require
> another RC, this looks pretty good to me.
>
> I built on Ubuntu 19 and ran with Java 11, -Pspark-ganglia-lgpl
> -Pkinesis-asl -Phadoop-3.2 -Phive-2.3 -Pyarn -Pmesos -Pkubernetes
> -Phive-thriftserver -Djava.version=11
>
> I did see the following test failures, but as usual, I'm not sure
> whether it's specific to me. Anyone else see these, particularly the R
> warnings?
>
>
> PythonUDFSuite:
> org.apache.spark.sql.execution.python.PythonUDFSuite *** ABORTED ***
>   java.lang.RuntimeException: Unable to load a Suite class that was
> discovered in the runpath:
> org.apache.spark.sql.execution.python.PythonUDFSuite
>   at
> org.scalatest.tools.DiscoverySuite$.getSuiteInstance(DiscoverySuite.scala:81)
>   at
> org.scalatest.tools.DiscoverySuite.$anonfun$nestedSuites$1(DiscoverySuite.scala:38)
>   at
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.Iterator.foreach(Iterator.scala:941)
>   at scala.collection.Iterator.foreach$(Iterator.scala:941)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
>   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
>   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>
>
> - SPARK-25158: Executor accidentally exit because
> ScriptTransformationWriterThread throw Exception *** FAILED ***
>   Expected exception org.apache.spark.SparkException to be thrown, but
> no exception was thrown (SQLQuerySuite.scala:2384)
>
>
> * checking for missing documentation entries ... WARNING
> Undocumented code objects:
>   ‘%<=>%’ ‘add_months’ ‘agg’ ‘approxCountDistinct’ ‘approxQuantile’
>   ‘approx_count_distinct’ ‘arrange’ ‘array_contains’ ‘array_distinct’
> ...
>  WARNING
> ‘qpdf’ is needed for checks on size reduction of PDFs
>
> On Tue, Mar 31, 2020 at 10:04 PM Reynold Xin  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 3.0.0.
> >
> > The vote is open until 11:59pm Pacific time Fri Apr 3, and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.0.0
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v3.0.0-rc1 (commit
> 6550d0d5283efdbbd838f3aeaf0476c7f52a0fb1):
> > https://github.com/apache/spark/tree/v3.0.0-rc1
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1341/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc1-docs/
> >
> > The list of bug fixes going into 2.4.5 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
> >
> > This release is using the release script of the tag v3.0.0-rc1.
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC