ASF board report for February

2019-02-09 Thread Matei Zaharia
It’s time to submit Spark's quarterly ASF board report on February 13th, so I 
wanted to run the report by everyone to make sure we’re not missing something. 
Let me know whether I missed anything:



Apache Spark is a fast and general engine for large-scale data processing. It 
offers high-level APIs in Java, Scala, Python and R as well as a rich set of 
libraries including stream processing, machine learning, and graph analytics. 

Project status:

- We released Apache Spark 2.2.3 on January 11th to fix bugs in the 2.2 branch. 
The community is also currently voting on a 2.3.3 release to bring recent fixes 
to the Spark 2.3 branch.

- Discussions are under way about the next feature release, which will likely 
be Spark 3.0, on our dev and user mailing lists. Some key questions include 
whether to remove various deprecated APIs, and which minimum versions of Java, 
Python, Scala, etc to support. There are also a number of new features 
targeting this release. We encourage everyone in the community to give feedback 
on these discussions through our mailing lists or issue tracker.

Trademarks:

- We are continuing engagement with various organizations.

Latest releases:

- Jan 11th, 2019: Spark 2.2.3
- Nov 2nd, 2018: Spark 2.4.0
- Sept 24th, 2018: Spark 2.3.2

Committers and PMC:

- We added Jose Torres as a new committer on January 29th.
- The latest committer was added on January 29th, 2019 (Jose Torres).
- The latest PMC member was added on Jan 12th, 2018 (Xiao Li).


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-09 Thread John Zhuge
Not me. I am running zulu8, maven, and hadoop-2.7.

On Sat, Feb 9, 2019 at 5:42 PM Felix Cheung 
wrote:

> One test in SparkSubmitSuite is consistently failing for me. Anyone seeing
> that?
>
>
> --
> *From:* Takeshi Yamamuro 
> *Sent:* Saturday, February 9, 2019 5:25 AM
> *To:* Spark dev list
> *Subject:* Re: [VOTE] Release Apache Spark 2.3.3 (RC2)
>
> Sorry, but I forgot to check ` -Pdocker-integration-tests` for the JDBC
> integration tests.
> I run these tests, and then I checked if they are passed.
>
> On Sat, Feb 9, 2019 at 5:26 PM Herman van Hovell 
> wrote:
>
>> I count 2 binding votes :)...
>>
>> Op vr 8 feb. 2019 om 22:36 schreef Felix Cheung <
>> felixcheun...@hotmail.com>
>>
>>> Nope, still only 1 binding vote ;)
>>>
>>>
>>> --
>>> *From:* Mark Hamstra 
>>> *Sent:* Friday, February 8, 2019 7:30 PM
>>> *To:* Marcelo Vanzin
>>> *Cc:* Takeshi Yamamuro; Spark dev list
>>> *Subject:* Re: [VOTE] Release Apache Spark 2.3.3 (RC2)
>>>
>>> There are 2. C'mon Marcelo, you can make it 3!
>>>
>>> On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin
>>>  wrote:
>>>
 Hi Takeshi,

 Since we only really have one +1 binding vote, do you want to extend
 this vote a bit?

 I've been stuck on a few things but plan to test this (setting things
 up now), but it probably won't happen before the deadline.

 On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
 wrote:
 >
 > Please vote on releasing the following candidate as Apache Spark
 version 2.3.3.
 >
 > The vote is open until February 8 6:00PM (PST) and passes if a
 majority +1 PMC votes are cast, with
 > a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 2.3.3
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see http://spark.apache.org/
 >
 > The tag to be voted on is v2.3.3-rc2 (commit
 66fd9c34bf406a4b5f86605d06c9607752bd637a):
 > https://github.com/apache/spark/tree/v2.3.3-rc2
 >
 > The release files, including signatures, digests, etc. can be found
 at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1298/
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
 >
 > The list of bug fixes going into 2.3.3 can be found at the following
 URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12343759
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by taking
 > an existing Spark workload and running on this release candidate, then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and install
 > the current RC and see if anything important breaks, in the Java/Scala
 > you can add the staging repository to your projects resolvers and test
 > with the RC (make sure to clean up the artifact cache before/after so
 > you don't end up building with a out of date RC going forward).
 >
 > ===
 > What should happen to JIRA tickets still targeting 2.3.3?
 > ===
 >
 > The current list of open tickets targeted at 2.3.3 can be found at:
 > https://issues.apache.org/jira/projects/SPARK and search for "Target
 Version/s" = 2.3.3
 >
 > Committers should look at those and triage. Extremely important bug
 > fixes, documentation, and API tweaks that impact compatibility should
 > be worked on immediately. Everything else please retarget to an
 > appropriate release.
 >
 > ==
 > But my bug isn't fixed?
 > ==
 >
 > In order to make timely releases, we will typically not hold the
 > release unless the bug in question is a regression from the previous
 > release. That being said, if there is something which is a regression
 > that has not been correctly targeted please ping me or a committer to
 > help target the issue.
 >
 > P.S.
 > I checked all the tests passed in the Amazon Linux 2 AMI;
 > $ java -version
 > openjdk version "1.8.0_191"
 > OpenJDK Runtime Environment (build 1.8.0_191-b12)
 > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
 > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
 -Psparkr test
 >
 > --
 > ---
 > Takeshi Yamamuro

Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-09 Thread Felix Cheung
One test in SparkSubmitSuite is consistently failing for me. Anyone seeing that?



From: Takeshi Yamamuro 
Sent: Saturday, February 9, 2019 5:25 AM
To: Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

Sorry, but I forgot to check ` -Pdocker-integration-tests` for the JDBC 
integration tests.
I run these tests, and then I checked if they are passed.

On Sat, Feb 9, 2019 at 5:26 PM Herman van Hovell 
mailto:her...@databricks.com>> wrote:
I count 2 binding votes :)...

Op vr 8 feb. 2019 om 22:36 schreef Felix Cheung 
mailto:felixcheun...@hotmail.com>>
Nope, still only 1 binding vote ;)



From: Mark Hamstra mailto:m...@clearstorydata.com>>
Sent: Friday, February 8, 2019 7:30 PM
To: Marcelo Vanzin
Cc: Takeshi Yamamuro; Spark dev list
Subject: Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

There are 2. C'mon Marcelo, you can make it 3!

On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin  
wrote:
Hi Takeshi,

Since we only really have one +1 binding vote, do you want to extend
this vote a bit?

I've been stuck on a few things but plan to test this (setting things
up now), but it probably won't happen before the deadline.

On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
mailto:linguin@gmail.com>> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 2.3.3.
>
> The vote is open until February 8 6:00PM (PST) and passes if a majority +1 
> PMC votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.3
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is v2.3.3-rc2 (commit 
> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
> https://github.com/apache/spark/tree/v2.3.3-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1298/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
>
> The list of bug fixes going into 2.3.3 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12343759
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.3?
> ===
>
> The current list of open tickets targeted at 2.3.3 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 2.3.3
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
> P.S.
> I checked all the tests passed in the Amazon Linux 2 AMI;
> $ java -version
> openjdk version "1.8.0_191"
> OpenJDK Runtime Environment (build 1.8.0_191-b12)
> OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
> $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Psparkr 
> test
>
> --
> ---
> Takeshi Yamamuro



--
Marcelo

-
To unsubscribe e-mail: 
dev-unsubscr...@spark.apache.org



--
---
Takeshi Yamamuro


Re: Vectorized R gapply[Collect]() implementation

2019-02-09 Thread Shivaram Venkataraman
Those speedups look awesome! Great work Hyukjin!

Thanks
Shivaram

On Sat, Feb 9, 2019 at 7:41 AM Hyukjin Kwon  wrote:
>
> Guys, as continuation of Arrow optimization for R DataFrame to Spark 
> DataFrame,
>
> I am trying to make a vectorized gapply[Collect] implementation as an 
> experiment like vectorized Pandas UDFs
>
> It brought 820%+ performance improvement. See 
> https://github.com/apache/spark/pull/23746
>
> Please come and take a look if you're interested in R APIs :D. I have already 
> cc'ed some people I know but please come, review and discuss for both Spark 
> side and Arrow side.
>
> This Arrow optimization job is being done under 
> https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to take 
> one if anyone of you is interested in it.
>
> Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Change default executor log URLs for YARN

2019-02-09 Thread Sean Owen
If many people find the current behavior OK, then honestly just don't
make this change. It's been there a while and the logs are available
for anyone who wants to browse through YARN.
While I think the change is fine, I can't see it being worth a flag to
toggle between two pretty trivially different UI behaviors.

On Sat, Feb 9, 2019 at 12:49 AM Felix Cheung  wrote:
>
> For this case I’d agree with Ryan. I haven’t followed this thread and the 
> details of the change since it’s way too much for me to consume “in my free 
> time” (which is 0 nowadays) but I’m pretty sure the existing behavior works 
> for us and very likely we don’t want it to change because of some proxy magic 
> we do behind the scene.
>
> I’d also agree config flag is not always the best way but in this case the 
> existing established behavior doesn’t seem broken...
>
> I could be wrong though.
>
>
> 
> From: Ryan Blue 
> Sent: Friday, February 8, 2019 4:39 PM
> To: Sean Owen
> Cc: Jungtaek Lim; dev
> Subject: Re: [DISCUSS] Change default executor log URLs for YARN
>
> I'm not sure that many people need this, so it is hard to make a decision. 
> I'm reluctant to change the current behavior if the result is a new papercut 
> to 99% of users and a win for 1%. The suggested change will work for 100% of 
> users, so if we don't want a flag then we should go with that. But I would 
> certainly want to turn it off in our environment because it doesn't provide 
> any value for us and would annoy our users.
>
> On Fri, Feb 8, 2019 at 4:18 PM Sean Owen  wrote:
>>
>> Is a flag needed? You know me, I think flags are often failures of
>> design, or disagreement punted to the user. I can understand retaining
>> old behavior under a flag where the behavior change could be
>> problematic for some users or facilitate migration, but this is just a
>> change to some UI links no? the underlying links don't change.
>> On Fri, Feb 8, 2019 at 5:41 PM Ryan Blue  wrote:
>> >
>> > I suggest using the current behavior as the default and add a flag to 
>> > implement the behavior you're suggesting: to link to the logs path in YARN 
>> > instead of directly to stderr and stdout.
>> >
>> > On Fri, Feb 8, 2019 at 3:33 PM Jungtaek Lim  wrote:
>> >>
>> >> Ryan,
>> >>
>> >> actually I'm not clear about your suggestion. For me three possible 
>> >> options here:
>> >>
>> >> 1. If we want to let users be able to completely rewrite log urls, that's 
>> >> SPARK-26792. For SHS we already addressed it.
>> >> 2. We could let users turning on/off flag option to just get one url or 
>> >> default two stdout/stderr urls.
>> >> 3. We could let users enumerate file names they want to link, and create 
>> >> log links for each file.
>> >>
>> >> Which one do you suggest?
>> >
>
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Vectorized R gapply[Collect]() implementation

2019-02-09 Thread Hyukjin Kwon
Guys, as continuation of Arrow optimization for R DataFrame to Spark
DataFrame,

I am trying to make a vectorized gapply[Collect] implementation as an
experiment like vectorized Pandas UDFs

It brought 820%+ performance improvement. See
https://github.com/apache/spark/pull/23746

Please come and take a look if you're interested in R APIs :D. I have
already cc'ed some people I know but please come, review and discuss for
both Spark side and Arrow side.

This Arrow optimization job is being done under
https://issues.apache.org/jira/browse/SPARK-26759 . Please feel free to
take one if anyone of you is interested in it.

Thanks.


Re: [VOTE] Release Apache Spark 2.3.3 (RC2)

2019-02-09 Thread Takeshi Yamamuro
Sorry, but I forgot to check ` -Pdocker-integration-tests` for the JDBC
integration tests.
I run these tests, and then I checked if they are passed.

On Sat, Feb 9, 2019 at 5:26 PM Herman van Hovell 
wrote:

> I count 2 binding votes :)...
>
> Op vr 8 feb. 2019 om 22:36 schreef Felix Cheung  >
>
>> Nope, still only 1 binding vote ;)
>>
>>
>> --
>> *From:* Mark Hamstra 
>> *Sent:* Friday, February 8, 2019 7:30 PM
>> *To:* Marcelo Vanzin
>> *Cc:* Takeshi Yamamuro; Spark dev list
>> *Subject:* Re: [VOTE] Release Apache Spark 2.3.3 (RC2)
>>
>> There are 2. C'mon Marcelo, you can make it 3!
>>
>> On Fri, Feb 8, 2019 at 5:03 PM Marcelo Vanzin 
>> wrote:
>>
>>> Hi Takeshi,
>>>
>>> Since we only really have one +1 binding vote, do you want to extend
>>> this vote a bit?
>>>
>>> I've been stuck on a few things but plan to test this (setting things
>>> up now), but it probably won't happen before the deadline.
>>>
>>> On Tue, Feb 5, 2019 at 5:07 PM Takeshi Yamamuro 
>>> wrote:
>>> >
>>> > Please vote on releasing the following candidate as Apache Spark
>>> version 2.3.3.
>>> >
>>> > The vote is open until February 8 6:00PM (PST) and passes if a
>>> majority +1 PMC votes are cast, with
>>> > a minimum of 3 +1 votes.
>>> >
>>> > [ ] +1 Release this package as Apache Spark 2.3.3
>>> > [ ] -1 Do not release this package because ...
>>> >
>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>> >
>>> > The tag to be voted on is v2.3.3-rc2 (commit
>>> 66fd9c34bf406a4b5f86605d06c9607752bd637a):
>>> > https://github.com/apache/spark/tree/v2.3.3-rc2
>>> >
>>> > The release files, including signatures, digests, etc. can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-bin/
>>> >
>>> > Signatures used for Spark RCs can be found in this file:
>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>> >
>>> > The staging repository for this release can be found at:
>>> >
>>> https://repository.apache.org/content/repositories/orgapachespark-1298/
>>> >
>>> > The documentation corresponding to this release can be found at:
>>> > https://dist.apache.org/repos/dist/dev/spark/v2.3.3-rc2-docs/
>>> >
>>> > The list of bug fixes going into 2.3.3 can be found at the following
>>> URL:
>>> > https://issues.apache.org/jira/projects/SPARK/versions/12343759
>>> >
>>> > FAQ
>>> >
>>> > =
>>> > How can I help test this release?
>>> > =
>>> >
>>> > If you are a Spark user, you can help us test this release by taking
>>> > an existing Spark workload and running on this release candidate, then
>>> > reporting any regressions.
>>> >
>>> > If you're working in PySpark you can set up a virtual env and install
>>> > the current RC and see if anything important breaks, in the Java/Scala
>>> > you can add the staging repository to your projects resolvers and test
>>> > with the RC (make sure to clean up the artifact cache before/after so
>>> > you don't end up building with a out of date RC going forward).
>>> >
>>> > ===
>>> > What should happen to JIRA tickets still targeting 2.3.3?
>>> > ===
>>> >
>>> > The current list of open tickets targeted at 2.3.3 can be found at:
>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 2.3.3
>>> >
>>> > Committers should look at those and triage. Extremely important bug
>>> > fixes, documentation, and API tweaks that impact compatibility should
>>> > be worked on immediately. Everything else please retarget to an
>>> > appropriate release.
>>> >
>>> > ==
>>> > But my bug isn't fixed?
>>> > ==
>>> >
>>> > In order to make timely releases, we will typically not hold the
>>> > release unless the bug in question is a regression from the previous
>>> > release. That being said, if there is something which is a regression
>>> > that has not been correctly targeted please ping me or a committer to
>>> > help target the issue.
>>> >
>>> > P.S.
>>> > I checked all the tests passed in the Amazon Linux 2 AMI;
>>> > $ java -version
>>> > openjdk version "1.8.0_191"
>>> > OpenJDK Runtime Environment (build 1.8.0_191-b12)
>>> > OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode)
>>> > $ ./build/mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos
>>> -Psparkr test
>>> >
>>> > --
>>> > ---
>>> > Takeshi Yamamuro
>>>
>>>
>>>
>>> --
>>> Marcelo
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
---
Takeshi Yamamuro