Re: Docker images for Spark 3.3.0 release are now available

2022-07-03 Thread Hyukjin Kwon
Thanks Gengliang.

On Tue, 28 Jun 2022 at 11:13, Gengliang Wang  wrote:

> Hi all,
>
> The official Docker images for Spark 3.3.0 release are now available!
>
>- To run Spark with Scala/Java API only:
>https://hub.docker.com/r/apache/spark
>- To run Python on Spark: https://hub.docker.com/r/apache/spark-py
>- To run R on Spark: https://hub.docker.com/r/apache/spark-r
>
>
> Gengliang
>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-23 Thread Hyukjin Kwon
Alright, I'll be there after Holden's talk Thursday
https://databricks.com/dataaisummit/session/tools-assisted-apache-spark-version-migrations-21-32
w/ Dongjoon (since he manages OSS Jenkins too).
Let's have a quickie chat :-).

On Thu, 23 Jun 2022 at 06:16, Hyukjin Kwon  wrote:

> Oops, I was confused about the time and distance in the US. I won't make
> it too.
> Let me find another time slot that works for more ppl.
>
> On Thu, 23 Jun 2022 at 00:19, Dongjoon Hyun 
> wrote:
>
>> Thank you, Hyukjin! :)
>>
>> BTW, unfortunately, it seems that I cannot join that quick meeting.
>> I have another schedule at South Bay around 7PM and need to leave San
>> Francisco at least 5PM.
>>
>> Dongjoon.
>>
>>
>> On Wed, Jun 22, 2022 at 3:39 AM Hyukjin Kwon  wrote:
>>
>>> (cc @Yikun Jiang  @Gengliang Wang
>>>  @Maxim Gekk 
>>> @Yang,Jie(INF)  FYI)
>>>
>>> On Wed, 22 Jun 2022 at 19:34, Hyukjin Kwon  wrote:
>>>
>>>> Couple of updates:
>>>>
>>>>-
>>>>
>>>>All builds passed now with all combinations we defined in the
>>>>GitHub Actions (e.g., branch-3.2, branch-3.3, JDK 11,
>>>>JDK 17 and Scala 2.13), see https://github.com/apache/spark/actions
>>>>cc @Tom Graves  @Dongjoon Hyun
>>>> FYI
>>>>-
>>>>
>>>>except one test that is being failed due to OOM. That’s being fixed
>>>>at https://github.com/apache/spark/pull/36954, see
>>>>also
>>>>https://github.com/apache/spark/pull/36787#discussion_r901190636
>>>>-
>>>>
>>>>I am now adding PySpark, SparkR jobs to the scheduled builds at
>>>>https://github.com/apache/spark/pull/36940
>>>>and see if they pass. We might need a couple of more fixes there.
>>>>-
>>>>
>>>>There’s one last task to simply caching the Docker image (
>>>>https://issues.apache.org/jira/browse/SPARK-39522).
>>>>I will have to be less active for this week and next week because
>>>>of the Spark Summit. Would appreciate if somebody
>>>>finds some time to take a stab.
>>>>
>>>> About a quick hallway meetup, I will be there after Holden’s talk at
>>>> least to say hello to her :-).
>>>> Let’s have a quick chat about our CI. We still have some general
>>>> problems to cope with like the lack of resources in
>>>> GitHub Actions.
>>>>
>>>>
>>>>
>>>> On Tue, 21 Jun 2022 at 11:49, Hyukjin Kwon  wrote:
>>>>
>>>>> Just chatted offline - both I and Holden have multiple sessions :-).
>>>>> Probably let's meet up for a quick chat after your talk
>>>>> https://databricks.com/dataaisummit/session/what-do-when-your-job-goes-oom-night-flowcharts
>>>>> ?
>>>>>
>>>>>
>>>>> On Mon, 20 Jun 2022 at 22:23, Holden Karau 
>>>>> wrote:
>>>>>
>>>>>> How about a hallway meet up at Data AI summit to talk about build CI
>>>>>> if folks are
>>>>>> Interested?
>>>>>>
>>>>>> On Sun, Jun 19, 2022 at 7:50 PM Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> Increased the priority to a blocker - I don't think we can release
>>>>>>> with these build failures and poor CI
>>>>>>>
>>>>>>> On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> There are too many test failures here. I pinged in some PRs I could
>>>>>>>> identify from a cursory look but would be great for you guys to take a 
>>>>>>>> look
>>>>>>>> if you guys haven't tested your change against other environments like 
>>>>>>>> JDK
>>>>>>>> 11, Scala 2.13.
>>>>>>>>
>>>>>>>> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> I am trying to rework GitHub Actions CI at
>>>>>>>>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would
>>>>>>>>> be very appreciated.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>>
>>>>>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-22 Thread Hyukjin Kwon
Oops, I was confused about the time and distance in the US. I won't make it
too.
Let me find another time slot that works for more ppl.

On Thu, 23 Jun 2022 at 00:19, Dongjoon Hyun  wrote:

> Thank you, Hyukjin! :)
>
> BTW, unfortunately, it seems that I cannot join that quick meeting.
> I have another schedule at South Bay around 7PM and need to leave San
> Francisco at least 5PM.
>
> Dongjoon.
>
>
> On Wed, Jun 22, 2022 at 3:39 AM Hyukjin Kwon  wrote:
>
>> (cc @Yikun Jiang  @Gengliang Wang
>>  @Maxim Gekk 
>> @Yang,Jie(INF)  FYI)
>>
>> On Wed, 22 Jun 2022 at 19:34, Hyukjin Kwon  wrote:
>>
>>> Couple of updates:
>>>
>>>-
>>>
>>>All builds passed now with all combinations we defined in the GitHub
>>>Actions (e.g., branch-3.2, branch-3.3, JDK 11,
>>>JDK 17 and Scala 2.13), see https://github.com/apache/spark/actions
>>>cc @Tom Graves  @Dongjoon Hyun
>>> FYI
>>>-
>>>
>>>except one test that is being failed due to OOM. That’s being fixed
>>>at https://github.com/apache/spark/pull/36954, see
>>>also https://github.com/apache/spark/pull/36787#discussion_r901190636
>>>-
>>>
>>>I am now adding PySpark, SparkR jobs to the scheduled builds at
>>>https://github.com/apache/spark/pull/36940
>>>and see if they pass. We might need a couple of more fixes there.
>>>-
>>>
>>>There’s one last task to simply caching the Docker image (
>>>https://issues.apache.org/jira/browse/SPARK-39522).
>>>I will have to be less active for this week and next week because of
>>>the Spark Summit. Would appreciate if somebody
>>>finds some time to take a stab.
>>>
>>> About a quick hallway meetup, I will be there after Holden’s talk at
>>> least to say hello to her :-).
>>> Let’s have a quick chat about our CI. We still have some general
>>> problems to cope with like the lack of resources in
>>> GitHub Actions.
>>>
>>>
>>>
>>> On Tue, 21 Jun 2022 at 11:49, Hyukjin Kwon  wrote:
>>>
>>>> Just chatted offline - both I and Holden have multiple sessions :-).
>>>> Probably let's meet up for a quick chat after your talk
>>>> https://databricks.com/dataaisummit/session/what-do-when-your-job-goes-oom-night-flowcharts
>>>> ?
>>>>
>>>>
>>>> On Mon, 20 Jun 2022 at 22:23, Holden Karau 
>>>> wrote:
>>>>
>>>>> How about a hallway meet up at Data AI summit to talk about build CI
>>>>> if folks are
>>>>> Interested?
>>>>>
>>>>> On Sun, Jun 19, 2022 at 7:50 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Increased the priority to a blocker - I don't think we can release
>>>>>> with these build failures and poor CI
>>>>>>
>>>>>> On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon 
>>>>>> wrote:
>>>>>>
>>>>>>> There are too many test failures here. I pinged in some PRs I could
>>>>>>> identify from a cursory look but would be great for you guys to take a 
>>>>>>> look
>>>>>>> if you guys haven't tested your change against other environments like 
>>>>>>> JDK
>>>>>>> 11, Scala 2.13.
>>>>>>>
>>>>>>> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I am trying to rework GitHub Actions CI at
>>>>>>>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would
>>>>>>>> be very appreciated.
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>> Twitter: https://twitter.com/holdenkarau
>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-22 Thread Hyukjin Kwon
(cc @Yikun Jiang  @Gengliang Wang
 @Maxim Gekk 
@Yang,Jie(INF)  FYI)

On Wed, 22 Jun 2022 at 19:34, Hyukjin Kwon  wrote:

> Couple of updates:
>
>-
>
>All builds passed now with all combinations we defined in the GitHub
>Actions (e.g., branch-3.2, branch-3.3, JDK 11,
>JDK 17 and Scala 2.13), see https://github.com/apache/spark/actions cc @Tom
>Graves  @Dongjoon Hyun 
> FYI
>-
>
>except one test that is being failed due to OOM. That’s being fixed at
>https://github.com/apache/spark/pull/36954, see
>also https://github.com/apache/spark/pull/36787#discussion_r901190636
>-
>
>I am now adding PySpark, SparkR jobs to the scheduled builds at
>https://github.com/apache/spark/pull/36940
>and see if they pass. We might need a couple of more fixes there.
>-
>
>There’s one last task to simply caching the Docker image (
>https://issues.apache.org/jira/browse/SPARK-39522).
>I will have to be less active for this week and next week because of
>the Spark Summit. Would appreciate if somebody
>finds some time to take a stab.
>
> About a quick hallway meetup, I will be there after Holden’s talk at least
> to say hello to her :-).
> Let’s have a quick chat about our CI. We still have some general problems
> to cope with like the lack of resources in
> GitHub Actions.
>
>
>
> On Tue, 21 Jun 2022 at 11:49, Hyukjin Kwon  wrote:
>
>> Just chatted offline - both I and Holden have multiple sessions :-).
>> Probably let's meet up for a quick chat after your talk
>> https://databricks.com/dataaisummit/session/what-do-when-your-job-goes-oom-night-flowcharts
>> ?
>>
>>
>> On Mon, 20 Jun 2022 at 22:23, Holden Karau  wrote:
>>
>>> How about a hallway meet up at Data AI summit to talk about build CI if
>>> folks are
>>> Interested?
>>>
>>> On Sun, Jun 19, 2022 at 7:50 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Increased the priority to a blocker - I don't think we can release with
>>>> these build failures and poor CI
>>>>
>>>> On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon  wrote:
>>>>
>>>>> There are too many test failures here. I pinged in some PRs I could
>>>>> identify from a cursory look but would be great for you guys to take a 
>>>>> look
>>>>> if you guys haven't tested your change against other environments like JDK
>>>>> 11, Scala 2.13.
>>>>>
>>>>> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I am trying to rework GitHub Actions CI at
>>>>>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would be
>>>>>> very appreciated.
>>>>>>
>>>>>>
>>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-22 Thread Hyukjin Kwon
Couple of updates:

   -

   All builds passed now with all combinations we defined in the GitHub
   Actions (e.g., branch-3.2, branch-3.3, JDK 11,
   JDK 17 and Scala 2.13), see https://github.com/apache/spark/actions cc @Tom
   Graves  @Dongjoon Hyun 
FYI
   -

   except one test that is being failed due to OOM. That’s being fixed at
   https://github.com/apache/spark/pull/36954, see
   also https://github.com/apache/spark/pull/36787#discussion_r901190636
   -

   I am now adding PySpark, SparkR jobs to the scheduled builds at
   https://github.com/apache/spark/pull/36940
   and see if they pass. We might need a couple of more fixes there.
   -

   There’s one last task to simply caching the Docker image (
   https://issues.apache.org/jira/browse/SPARK-39522).
   I will have to be less active for this week and next week because of the
   Spark Summit. Would appreciate if somebody
   finds some time to take a stab.

About a quick hallway meetup, I will be there after Holden’s talk at least
to say hello to her :-).
Let’s have a quick chat about our CI. We still have some general problems
to cope with like the lack of resources in
GitHub Actions.



On Tue, 21 Jun 2022 at 11:49, Hyukjin Kwon  wrote:

> Just chatted offline - both I and Holden have multiple sessions :-).
> Probably let's meet up for a quick chat after your talk
> https://databricks.com/dataaisummit/session/what-do-when-your-job-goes-oom-night-flowcharts
> ?
>
>
> On Mon, 20 Jun 2022 at 22:23, Holden Karau  wrote:
>
>> How about a hallway meet up at Data AI summit to talk about build CI if
>> folks are
>> Interested?
>>
>> On Sun, Jun 19, 2022 at 7:50 PM Hyukjin Kwon  wrote:
>>
>>> Increased the priority to a blocker - I don't think we can release with
>>> these build failures and poor CI
>>>
>>> On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon  wrote:
>>>
>>>> There are too many test failures here. I pinged in some PRs I could
>>>> identify from a cursory look but would be great for you guys to take a look
>>>> if you guys haven't tested your change against other environments like JDK
>>>> 11, Scala 2.13.
>>>>
>>>> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon  wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I am trying to rework GitHub Actions CI at
>>>>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would be
>>>>> very appreciated.
>>>>>
>>>>>
>>>>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-20 Thread Hyukjin Kwon
Just chatted offline - both I and Holden have multiple sessions :-).
Probably let's meet up for a quick chat after your talk
https://databricks.com/dataaisummit/session/what-do-when-your-job-goes-oom-night-flowcharts
?


On Mon, 20 Jun 2022 at 22:23, Holden Karau  wrote:

> How about a hallway meet up at Data AI summit to talk about build CI if
> folks are
> Interested?
>
> On Sun, Jun 19, 2022 at 7:50 PM Hyukjin Kwon  wrote:
>
>> Increased the priority to a blocker - I don't think we can release with
>> these build failures and poor CI
>>
>> On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon  wrote:
>>
>>> There are too many test failures here. I pinged in some PRs I could
>>> identify from a cursory look but would be great for you guys to take a look
>>> if you guys haven't tested your change against other environments like JDK
>>> 11, Scala 2.13.
>>>
>>> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon  wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am trying to rework GitHub Actions CI at
>>>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would be
>>>> very appreciated.
>>>>
>>>>
>>>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


[PSA] Please rebase and sync your master branch in your forked repository

2022-06-20 Thread Hyukjin Kwon
After https://github.com/apache/spark/pull/36922 gets merged, it requires
your fork's master branch to be synced to the latest master branch in
Apache Spark. Otherwise, builds would not be triggered in your PR.


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-19 Thread Hyukjin Kwon
Increased the priority to a blocker - I don't think we can release with
these build failures and poor CI

On Mon, 20 Jun 2022 at 10:39, Hyukjin Kwon  wrote:

> There are too many test failures here. I pinged in some PRs I could
> identify from a cursory look but would be great for you guys to take a look
> if you guys haven't tested your change against other environments like JDK
> 11, Scala 2.13.
>
> On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am trying to rework GitHub Actions CI at
>> https://issues.apache.org/jira/browse/SPARK-39515. Any help would be
>> very appreciated.
>>
>>
>>


Re: [SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-19 Thread Hyukjin Kwon
There are too many test failures here. I pinged in some PRs I could
identify from a cursory look but would be great for you guys to take a look
if you guys haven't tested your change against other environments like JDK
11, Scala 2.13.

On Mon, 20 Jun 2022 at 10:04, Hyukjin Kwon  wrote:

> Hi all,
>
> I am trying to rework GitHub Actions CI at
> https://issues.apache.org/jira/browse/SPARK-39515. Any help would be very
> appreciated.
>
>
>


[SPARK-39515] Improve scheduled jobs in GitHub Actions

2022-06-19 Thread Hyukjin Kwon
Hi all,

I am trying to rework GitHub Actions CI at
https://issues.apache.org/jira/browse/SPARK-39515. Any help would be very
appreciated.


Re: [VOTE][RESULT] SPIP: Spark Connect

2022-06-16 Thread Hyukjin Kwon
Awesome, I am excited to see this in Apache Spark.

On Fri, 17 Jun 2022 at 08:37, Herman van Hovell
 wrote:

> The vote passes with 17 +1s (10 binding +1s).
> +1:
> Herman van Hovell*
> Matei Zaharia*
> Yuming Wang
> Hyukjin Kwon*
> Chao Sun
> L.C. Hsieh*
> Huaxin Gao
> Ruifeng Zheng
> Wenchen Fan*
> Believer
> Xiao Li*
> Reynold Xin*
> Dongjoon Hyun*
> Gangliang Wang
> Yikun Jiang
> Tom Graves *
> Holden Karau *
>
> 0: None
> (Tom has voiced some architectural concerns)
>
> -1: None
>
> (* = binding)
>
> The next step is that we are going to create a high level design doc,
> which will give clarity on the design and should (hopefully) take away any
> remaining concerns.
>
> Thank you all for chiming in and your votes!
>
> Cheers,
> Herman
>


Re: Stickers and Swag

2022-06-14 Thread Hyukjin Kwon
Woohoo

On Tue, 14 Jun 2022 at 15:04, Xiao Li  wrote:

> Hi, all,
>
> The ASF has an official store at RedBubble
>  that Apache Community
> Development (ComDev) runs. If you are interested in buying Spark Swag, 70
> products featuring the Spark logo are available:
> https://www.redbubble.com/shop/ap/113203780
>
> Go Spark!
>
> Xiao
>


Re: [VOTE][SPIP] Spark Connect

2022-06-13 Thread Hyukjin Kwon
+1

On Tue, 14 Jun 2022 at 08:50, Yuming Wang  wrote:

> +1.
>
> On Tue, Jun 14, 2022 at 2:20 AM Matei Zaharia 
> wrote:
>
>> +1, very excited about this direction.
>>
>> Matei
>>
>> On Jun 13, 2022, at 11:07 AM, Herman van Hovell <
>> her...@databricks.com.INVALID> wrote:
>>
>> Let me kick off the voting...
>>
>> +1
>>
>> On Mon, Jun 13, 2022 at 2:02 PM Herman van Hovell 
>> wrote:
>>
>>> Hi all,
>>>
>>> I’d like to start a vote for SPIP: "Spark Connect"
>>>
>>> The goal of the SPIP is to introduce a Dataframe based client/server
>>> API for Spark
>>>
>>> Please also refer to:
>>>
>>> - Previous discussion in dev mailing list: [DISCUSS] SPIP: Spark
>>> Connect - A client and server interface for Apache Spark.
>>> 
>>> - Design doc: Spark Connect - A client and server interface for Apache
>>> Spark.
>>> 
>>> - JIRA: SPARK-39375 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Kind Regards,
>>> Herman
>>>
>>
>>


Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-08 Thread Hyukjin Kwon
Okay. Thankfully the binary release is fine per
https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-build.sh#L268
.
The source package (and GitHub tag) has 3.3.0.dev0, and the binary package
has 3.3.0. Technically this is not a blocker now because PyPI upload will
be able to be made correctly.
I lowered the priority to critical. I switch my -1 to 0.

On Wed, 8 Jun 2022 at 15:17, Hyukjin Kwon  wrote:

> Arrrgh  .. I am very sorry that I found this problem late.
> RC 5 does not have the correct version of PySpark, see
> https://github.com/apache/spark/blob/v3.3.0-rc5/python/pyspark/version.py#L19
> I think the release script was broken because the version now has 'str'
> type, see
> https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-tag.sh#L88
> I filed a JIRA at https://issues.apache.org/jira/browse/SPARK-39411
>
> -1 from me
>
>
>
> On Wed, 8 Jun 2022 at 13:16, Cheng Pan  wrote:
>
>> +1 (non-binding)
>>
>> * Verified SPARK-39313 has been address[1]
>> * Passed integration test w/ Apache Kyuubi (Incubating)[2]
>>
>> [1] https://github.com/housepower/spark-clickhouse-connector/pull/123
>> [2] https://github.com/apache/incubator-kyuubi/pull/2817
>>
>> Thanks,
>> Cheng Pan
>>
>> On Wed, Jun 8, 2022 at 7:04 AM Chris Nauroth  wrote:
>> >
>> > +1 (non-binding)
>> >
>> > * Verified all checksums.
>> > * Verified all signatures.
>> > * Built from source, with multiple profiles, to full success, for Java
>> 11 and Scala 2.13:
>> > * build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver
>> -Pkubernetes -Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
>> > * Tests passed.
>> > * Ran several examples successfully:
>> > * bin/spark-submit --class org.apache.spark.examples.SparkPi
>> examples/jars/spark-examples_2.12-3.3.0.jar
>> > * bin/spark-submit --class
>> org.apache.spark.examples.sql.hive.SparkHiveExample
>> examples/jars/spark-examples_2.12-3.3.0.jar
>> > * bin/spark-submit
>> examples/src/main/python/streaming/network_wordcount.py localhost 
>> > * Tested some of the issues that blocked prior release candidates:
>> > * bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT
>> true) t(x) UNION SELECT 1 AS a;'
>> > * bin/spark-sql -e "select date '2018-11-17' > 1"
>> > * SPARK-39293 ArrayAggregate fix
>> >
>> > Chris Nauroth
>> >
>> >
>> > On Tue, Jun 7, 2022 at 1:30 PM Cheng Su  wrote:
>> >>
>> >> +1 (non-binding). Built and ran some internal test for Spark SQL.
>> >>
>> >>
>> >>
>> >> Thanks,
>> >>
>> >> Cheng Su
>> >>
>> >>
>> >>
>> >> From: L. C. Hsieh 
>> >> Date: Tuesday, June 7, 2022 at 1:23 PM
>> >> To: dev 
>> >> Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
>> >>
>> >> +1
>> >>
>> >> Liang-Chi
>> >>
>> >> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang 
>> wrote:
>> >> >
>> >> > +1 (non-binding)
>> >> >
>> >> > Gengliang
>> >> >
>> >> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves 
>> wrote:
>> >> >>
>> >> >> +1
>> >> >>
>> >> >> Tom Graves
>> >> >>
>> >> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
>> >> >>  wrote:
>> >> >> >
>> >> >> > Please vote on releasing the following candidate as Apache Spark
>> version 3.3.0.
>> >> >> >
>> >> >> > The vote is open until 11:59pm Pacific time June 8th and passes
>> if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >> >> >
>> >> >> > [ ] +1 Release this package as Apache Spark 3.3.0
>> >> >> > [ ] -1 Do not release this package because ...
>> >> >> >
>> >> >> > To learn more about Apache Spark, please see
>> http://spark.apache.org/
>> >> >> >
>> >> >> > The tag to be voted on is v3.3.0-rc5 (commit
>> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
>> >> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
>> >> >> >
>> >> >> > The release files, including signatures, digests, etc. can be
>> found at:
>&

Re: [VOTE] Release Spark 3.3.0 (RC5)

2022-06-08 Thread Hyukjin Kwon
Arrrgh  .. I am very sorry that I found this problem late.
RC 5 does not have the correct version of PySpark, see
https://github.com/apache/spark/blob/v3.3.0-rc5/python/pyspark/version.py#L19
I think the release script was broken because the version now has 'str'
type, see
https://github.com/apache/spark/blob/v3.3.0-rc5/dev/create-release/release-tag.sh#L88
I filed a JIRA at https://issues.apache.org/jira/browse/SPARK-39411

-1 from me



On Wed, 8 Jun 2022 at 13:16, Cheng Pan  wrote:

> +1 (non-binding)
>
> * Verified SPARK-39313 has been address[1]
> * Passed integration test w/ Apache Kyuubi (Incubating)[2]
>
> [1] https://github.com/housepower/spark-clickhouse-connector/pull/123
> [2] https://github.com/apache/incubator-kyuubi/pull/2817
>
> Thanks,
> Cheng Pan
>
> On Wed, Jun 8, 2022 at 7:04 AM Chris Nauroth  wrote:
> >
> > +1 (non-binding)
> >
> > * Verified all checksums.
> > * Verified all signatures.
> > * Built from source, with multiple profiles, to full success, for Java
> 11 and Scala 2.13:
> > * build/mvn -Phadoop-3 -Phadoop-cloud -Phive-thriftserver
> -Pkubernetes -Pscala-2.13 -Psparkr -Pyarn -DskipTests clean package
> > * Tests passed.
> > * Ran several examples successfully:
> > * bin/spark-submit --class org.apache.spark.examples.SparkPi
> examples/jars/spark-examples_2.12-3.3.0.jar
> > * bin/spark-submit --class
> org.apache.spark.examples.sql.hive.SparkHiveExample
> examples/jars/spark-examples_2.12-3.3.0.jar
> > * bin/spark-submit
> examples/src/main/python/streaming/network_wordcount.py localhost 
> > * Tested some of the issues that blocked prior release candidates:
> > * bin/spark-sql -e 'SELECT (SELECT IF(x, 1, 0)) AS a FROM (SELECT
> true) t(x) UNION SELECT 1 AS a;'
> > * bin/spark-sql -e "select date '2018-11-17' > 1"
> > * SPARK-39293 ArrayAggregate fix
> >
> > Chris Nauroth
> >
> >
> > On Tue, Jun 7, 2022 at 1:30 PM Cheng Su  wrote:
> >>
> >> +1 (non-binding). Built and ran some internal test for Spark SQL.
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Cheng Su
> >>
> >>
> >>
> >> From: L. C. Hsieh 
> >> Date: Tuesday, June 7, 2022 at 1:23 PM
> >> To: dev 
> >> Subject: Re: [VOTE] Release Spark 3.3.0 (RC5)
> >>
> >> +1
> >>
> >> Liang-Chi
> >>
> >> On Tue, Jun 7, 2022 at 1:03 PM Gengliang Wang  wrote:
> >> >
> >> > +1 (non-binding)
> >> >
> >> > Gengliang
> >> >
> >> > On Tue, Jun 7, 2022 at 12:24 PM Thomas Graves 
> wrote:
> >> >>
> >> >> +1
> >> >>
> >> >> Tom Graves
> >> >>
> >> >> On Sat, Jun 4, 2022 at 9:50 AM Maxim Gekk
> >> >>  wrote:
> >> >> >
> >> >> > Please vote on releasing the following candidate as Apache Spark
> version 3.3.0.
> >> >> >
> >> >> > The vote is open until 11:59pm Pacific time June 8th and passes if
> a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >> >> >
> >> >> > [ ] +1 Release this package as Apache Spark 3.3.0
> >> >> > [ ] -1 Do not release this package because ...
> >> >> >
> >> >> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> >> >> >
> >> >> > The tag to be voted on is v3.3.0-rc5 (commit
> 7cf29705272ab8e8c70e8885a3664ad8ae3cd5e9):
> >> >> > https://github.com/apache/spark/tree/v3.3.0-rc5
> >> >> >
> >> >> > The release files, including signatures, digests, etc. can be
> found at:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-bin/
> >> >> >
> >> >> > Signatures used for Spark RCs can be found in this file:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >> >> >
> >> >> > The staging repository for this release can be found at:
> >> >> >
> https://repository.apache.org/content/repositories/orgapachespark-1406
> >> >> >
> >> >> > The documentation corresponding to this release can be found at:
> >> >> > https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc5-docs/
> >> >> >
> >> >> > The list of bug fixes going into 3.3.0 can be found at the
> following URL:
> >> >> > https://issues.apache.org/jira/projects/SPARK/versions/12350369
> >> >> >
> >> >> > This release is using the release script of the tag v3.3.0-rc5.
> >> >> >
> >> >> >
> >> >> > FAQ
> >> >> >
> >> >> > =
> >> >> > How can I help test this release?
> >> >> > =
> >> >> > If you are a Spark user, you can help us test this release by
> taking
> >> >> > an existing Spark workload and running on this release candidate,
> then
> >> >> > reporting any regressions.
> >> >> >
> >> >> > If you're working in PySpark you can set up a virtual env and
> install
> >> >> > the current RC and see if anything important breaks, in the
> Java/Scala
> >> >> > you can add the staging repository to your projects resolvers and
> test
> >> >> > with the RC (make sure to clean up the artifact cache before/after
> so
> >> >> > you don't end up building with a out of date RC going forward).
> >> >> >
> >> >> > ===
> >> >> > What should happen to JIRA tickets still targeting 3.3.0?
> >> >> > 

Please stop creating new JIRA version such as 3.4

2022-06-06 Thread Hyukjin Kwon
Hi all,

I see some people repeatedly create new versions such as "3.4" (it has to
be "3.4.0") in JIRA.
[image: Screen Shot 2022-06-07 at 2.29.02 PM.png]

I manually check, remove and reassign them but I think it's the fifth time
IIRC.

Please avoid creating a new version such as 3.4 without maintenance version
specification.


Re: [DISCUSS] SPIP: Spark Connect - A client and server interface for Apache Spark.

2022-06-06 Thread Hyukjin Kwon
What I like most about this SPIP are:
1. We could leverage this SPIP to dispatch the driver to the cluster (e.g.,
yarn-cluster or K8S cluster mode) with an interactive shell which Spark
currently doesn't support.
2. Makes it easier for other languages to support, especially given that we
talked about some other languages like Go or .net in the past.

While 1. I don't think we can (or should) implement all the API and 2. the
details would have to be discussed thoroughly, I think this is a good idea
to have this layer.



On Mon, 6 Jun 2022 at 17:47, Martin Grund
 wrote:

> Hi Mich,
>
> I think I must have been not clear enough in the document. The proposal is
> not for connecting Spark to other engines but to connect to Spark from
> other clients remotely (without using SQL)
>
> Please let me know if that clarifies things or if I can provide additional
> context.
>
> Thanks
> Martin
>
> On Sun 5. Jun 2022 at 16:38 Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Whilst I concur that there is a need for client server architecture, that
>> technology has been around over 30 years. Moreover the current spark had
>> vey efficient connections via JDBC to various databases. In some cases the
>> API to various databases, for example Google BiqQuery is very efficient. I
>> am not sure what this proposal is to trying to address?
>>
>> HTH
>>
>> On Fri, 3 Jun 2022 at 18:46, Martin Grund ent server
>>  wrote:
>>
>>> Hi Everyone,
>>>
>>> We would like to start a discussion on the "Spark Connect" proposal.
>>> Please find the links below:
>>>
>>> *JIRA* - https://issues.apache.org/jira/browse/SPARK-39375
>>> *SPIP Document* -
>>> https://docs.google.com/document/d/1Mnl6jmGszixLW4KcJU5j9IgpG9-UabS0dcM6PM2XGDc/edit#heading=h.wmsrrfealhrj
>>>
>>> *Excerpt from the document: *
>>>
>>> We propose to extend Apache Spark by building on the DataFrame API and
>>> the underlying unresolved logical plans. The DataFrame API is widely used
>>> and makes it very easy to iteratively express complex logic. We will
>>> introduce Spark Connect, a remote option of the DataFrame API that
>>> separates the client from the Spark server. With Spark Connect, Spark will
>>> become decoupled, allowing for built-in remote connectivity: The decoupled
>>> client SDK can be used to run interactive data exploration and connect to
>>> the server for DataFrame operations.
>>>
>>> Spark Connect will benefit Spark developers in different ways: The
>>> decoupled architecture will result in improved stability, as clients are
>>> separated from the driver. From the Spark Connect client perspective, Spark
>>> will be (almost) versionless, and thus enable seamless upgradability, as
>>> server APIs can evolve without affecting the client API. The decoupled
>>> client-server architecture can be leveraged to build close integrations
>>> with local developer tooling. Finally, separating the client process from
>>> the Spark server process will improve Spark’s overall security posture by
>>> avoiding the tight coupling of the client inside the Spark runtime
>>> environment.
>>>
>>> Spark Connect will strengthen Spark’s position as the modern unified
>>> engine for large-scale data analytics and expand applicability to use cases
>>> and developers we could not reach with the current setup: Spark will become
>>> ubiquitously usable as the DataFrame API can be used with (almost) any
>>> programming language.
>>>
>>> We would like to start a discussion on the document and any feedback is
>>> welcome!
>>>
>>> Thanks a lot in advance,
>>> Martin
>>>
>> --
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>


Re: Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
There might be other blockers. Lets wait and see.

On Tue, May 17, 2022 at 8:59 PM beliefer  wrote:

> OK. let it into 3.3.1
>
>
> 在 2022-05-17 18:59:13,"Hyukjin Kwon"  写道:
>
> I think most users won't be affected since aggregate pushdown is disabled
> by default.
>
> On Tue, 17 May 2022 at 19:53, beliefer  wrote:
>
>> If we not contains https://github.com/apache/spark/pull/36556, we will
>> break change when we merge it into 3.3.1
>>
>> At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:
>>
>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>
>> We will likely have to change the version being added if RC2 passes.
>> Since this is a new API/improvement, I would prefer to not block the
>> release by that.
>>
>> On Tue, 17 May 2022 at 19:19, beliefer  wrote:
>>
>>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>>
>>>
>>> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>>>
>>> That seems like a test-only issue. I made a quick followup at
>>> https://github.com/apache/spark/pull/36576.
>>>
>>> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>>>
>>>> I'm still seeing failures related to the function registry, like:
>>>>
>>>> ExpressionsSchemaSuite:
>>>> - Check schemas for expression examples *** FAILED ***
>>>>   396 did not equal 398 Expected 396 blocks in result file but got 398.
>>>> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>>>
>>>> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>>>>   "Function: bloom_filter_aggClass:
>>>> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
>>>> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
>>>> the result) (QueryTest.scala:54)
>>>>
>>>> There seems to be consistently a difference of 2 in the list of
>>>> expected functions and actual. I haven't looked closely, don't know this
>>>> code. I'm on Ubuntu 22.04. Anyone else seeing something like this?
>>>> Wondering if it's something weird to do with case sensitivity, hidden files
>>>> lurking somewhere, etc.
>>>>
>>>> I suspect it's not a 'real' error as the Linux-based testers work fine,
>>>> but I also can't think of why this is failing.
>>>>
>>>>
>>>>
>>>> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>>>>  wrote:
>>>>
>>>>> Please vote on releasing the following candidate as
>>>>> Apache Spark version 3.3.0.
>>>>>
>>>>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v3.3.0-rc2 (commit
>>>>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>>>>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>>>>
>>>>> The list of bug fixes going into 3.3.0 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>>>
>>>>> This release is using the release script of the tag v3.3.0-rc2.
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>&

Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
And seems like it won't break it because adding a new method won't break
binary compatibility.

On Tue, 17 May 2022 at 19:59, Hyukjin Kwon  wrote:

> I think most users won't be affected since aggregate pushdown is disabled
> by default.
>
> On Tue, 17 May 2022 at 19:53, beliefer  wrote:
>
>> If we not contains https://github.com/apache/spark/pull/36556, we will
>> break change when we merge it into 3.3.1
>>
>> At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:
>>
>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>
>> We will likely have to change the version being added if RC2 passes.
>> Since this is a new API/improvement, I would prefer to not block the
>> release by that.
>>
>> On Tue, 17 May 2022 at 19:19, beliefer  wrote:
>>
>>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>>
>>>
>>> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>>>
>>> That seems like a test-only issue. I made a quick followup at
>>> https://github.com/apache/spark/pull/36576.
>>>
>>> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>>>
>>>> I'm still seeing failures related to the function registry, like:
>>>>
>>>> ExpressionsSchemaSuite:
>>>> - Check schemas for expression examples *** FAILED ***
>>>>   396 did not equal 398 Expected 396 blocks in result file but got 398.
>>>> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>>>
>>>> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>>>>   "Function: bloom_filter_aggClass:
>>>> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
>>>> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
>>>> the result) (QueryTest.scala:54)
>>>>
>>>> There seems to be consistently a difference of 2 in the list of
>>>> expected functions and actual. I haven't looked closely, don't know this
>>>> code. I'm on Ubuntu 22.04. Anyone else seeing something like this?
>>>> Wondering if it's something weird to do with case sensitivity, hidden files
>>>> lurking somewhere, etc.
>>>>
>>>> I suspect it's not a 'real' error as the Linux-based testers work fine,
>>>> but I also can't think of why this is failing.
>>>>
>>>>
>>>>
>>>> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>>>>  wrote:
>>>>
>>>>> Please vote on releasing the following candidate as
>>>>> Apache Spark version 3.3.0.
>>>>>
>>>>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v3.3.0-rc2 (commit
>>>>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>>>>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>>>>
>>>>> The list of bug fixes going into 3.3.0 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>>>
>>>>> This release is using the release script of the tag v3.3.0-rc2.
>>>>>
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>> If you are a Spark user, you can help us test this release by ta

Re: Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
I think most users won't be affected since aggregate pushdown is disabled
by default.

On Tue, 17 May 2022 at 19:53, beliefer  wrote:

> If we not contains https://github.com/apache/spark/pull/36556, we will
> break change when we merge it into 3.3.1
>
> At 2022-05-17 18:26:12, "Hyukjin Kwon"  wrote:
>
> We need add https://github.com/apache/spark/pull/36556 to RC2.
>
> We will likely have to change the version being added if RC2 passes.
> Since this is a new API/improvement, I would prefer to not block the
> release by that.
>
> On Tue, 17 May 2022 at 19:19, beliefer  wrote:
>
>> We need add https://github.com/apache/spark/pull/36556 to RC2.
>>
>>
>> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>>
>> That seems like a test-only issue. I made a quick followup at
>> https://github.com/apache/spark/pull/36576.
>>
>> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>>
>>> I'm still seeing failures related to the function registry, like:
>>>
>>> ExpressionsSchemaSuite:
>>> - Check schemas for expression examples *** FAILED ***
>>>   396 did not equal 398 Expected 396 blocks in result file but got 398.
>>> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>>
>>> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>>>   "Function: bloom_filter_aggClass:
>>> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
>>> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
>>> the result) (QueryTest.scala:54)
>>>
>>> There seems to be consistently a difference of 2 in the list of expected
>>> functions and actual. I haven't looked closely, don't know this code. I'm
>>> on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
>>> something weird to do with case sensitivity, hidden files lurking
>>> somewhere, etc.
>>>
>>> I suspect it's not a 'real' error as the Linux-based testers work fine,
>>> but I also can't think of why this is failing.
>>>
>>>
>>>
>>> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>>>  wrote:
>>>
>>>> Please vote on releasing the following candidate as
>>>> Apache Spark version 3.3.0.
>>>>
>>>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>
>>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>>> [ ] -1 Do not release this package because ...
>>>>
>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>
>>>> The tag to be voted on is v3.3.0-rc2 (commit
>>>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>>>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>>>
>>>> The release files, including signatures, digests, etc. can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>>>
>>>> Signatures used for Spark RCs can be found in this file:
>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>
>>>> The staging repository for this release can be found at:
>>>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>>>
>>>> The documentation corresponding to this release can be found at:
>>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>>>
>>>> The list of bug fixes going into 3.3.0 can be found at the following
>>>> URL:
>>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>>
>>>> This release is using the release script of the tag v3.3.0-rc2.
>>>>
>>>>
>>>> FAQ
>>>>
>>>> =
>>>> How can I help test this release?
>>>> =
>>>> If you are a Spark user, you can help us test this release by taking
>>>> an existing Spark workload and running on this release candidate, then
>>>> reporting any regressions.
>>>>
>>>> If you're working in PySpark you can set up a virtual env and install
>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>> you can add the staging repository to your projects resolvers and test
>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>> you d

Re: Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
We need add https://github.com/apache/spark/pull/36556 to RC2.

We will likely have to change the version being added if RC2 passes.
Since this is a new API/improvement, I would prefer to not block the
release by that.

On Tue, 17 May 2022 at 19:19, beliefer  wrote:

> We need add https://github.com/apache/spark/pull/36556 to RC2.
>
>
> 在 2022-05-17 17:37:13,"Hyukjin Kwon"  写道:
>
> That seems like a test-only issue. I made a quick followup at
> https://github.com/apache/spark/pull/36576.
>
> On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:
>
>> I'm still seeing failures related to the function registry, like:
>>
>> ExpressionsSchemaSuite:
>> - Check schemas for expression examples *** FAILED ***
>>   396 did not equal 398 Expected 396 blocks in result file but got 398.
>> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>>
>> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>>   "Function: bloom_filter_aggClass:
>> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
>> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
>> the result) (QueryTest.scala:54)
>>
>> There seems to be consistently a difference of 2 in the list of expected
>> functions and actual. I haven't looked closely, don't know this code. I'm
>> on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
>> something weird to do with case sensitivity, hidden files lurking
>> somewhere, etc.
>>
>> I suspect it's not a 'real' error as the Linux-based testers work fine,
>> but I also can't think of why this is failing.
>>
>>
>>
>> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>>  wrote:
>>
>>> Please vote on releasing the following candidate as
>>> Apache Spark version 3.3.0.
>>>
>>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.3.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.3.0-rc2 (commit
>>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>>
>>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>>
>>> This release is using the release script of the tag v3.3.0-rc2.
>>>
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.3.0?
>>> ===
>>> The current list of open tickets targeted at 3.3.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.3.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>> Maxim Gekk
>>>
>>> Software Engineer
>>>
>>> Databricks, Inc.
>>>
>>
>
>
>


Re: [VOTE] Release Spark 3.3.0 (RC2)

2022-05-17 Thread Hyukjin Kwon
That seems like a test-only issue. I made a quick followup at
https://github.com/apache/spark/pull/36576.

On Tue, 17 May 2022 at 03:56, Sean Owen  wrote:

> I'm still seeing failures related to the function registry, like:
>
> ExpressionsSchemaSuite:
> - Check schemas for expression examples *** FAILED ***
>   396 did not equal 398 Expected 396 blocks in result file but got 398.
> Try regenerating the result files. (ExpressionsSchemaSuite.scala:161)
>
> - SPARK-14415: All functions should have own descriptions *** FAILED ***
>   "Function: bloom_filter_aggClass:
> org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregateUsage:
> N/A." contained "N/A." Failed for [function_desc: string] (N/A. existed in
> the result) (QueryTest.scala:54)
>
> There seems to be consistently a difference of 2 in the list of expected
> functions and actual. I haven't looked closely, don't know this code. I'm
> on Ubuntu 22.04. Anyone else seeing something like this? Wondering if it's
> something weird to do with case sensitivity, hidden files lurking
> somewhere, etc.
>
> I suspect it's not a 'real' error as the Linux-based testers work fine,
> but I also can't think of why this is failing.
>
>
>
> On Mon, May 16, 2022 at 7:44 AM Maxim Gekk
>  wrote:
>
>> Please vote on releasing the following candidate as
>> Apache Spark version 3.3.0.
>>
>> The vote is open until 11:59pm Pacific time May 19th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.0
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.3.0-rc2 (commit
>> c8c657b922ac8fd8dcf9553113e11a80079db059):
>> https://github.com/apache/spark/tree/v3.3.0-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1403
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc2-docs/
>>
>> The list of bug fixes going into 3.3.0 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>>
>> This release is using the release script of the tag v3.3.0-rc2.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.0?
>> ===
>> The current list of open tickets targeted at 3.3.0 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.0
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>> Maxim Gekk
>>
>> Software Engineer
>>
>> Databricks, Inc.
>>
>


Re: Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-16 Thread Hyukjin Kwon
Thanks Ruifeng.

I added "Pandas API on Spark" component JIRA (and archived "jenkins"
component since we don't have the legacy Jenkins anymore).
Let me know if you guys have other opinions.

On Tue, 17 May 2022 at 12:59, Ruifeng Zheng  wrote:

> +1, I think it is a good idea
>
>
> -- 原始邮件 --
> *发件人:* "Hyukjin Kwon" ;
> *发送时间:* 2022年5月17日(星期二) 中午11:26
> *收件人:* "dev";
> *抄送:* "Yikun Jiang";"Xinrong Meng"<
> xinrong.m...@databricks.com>;"Xiao Li";"Takuya
> Ueshin";"Haejoon 
> Lee";"Ruifeng
> Zheng";
> *主题:* Introducing "Pandas API on Spark" component in JIRA, and use "PS"
> PR title component
>
> Hi all,
>
> What about we introduce a component in JIRA "Pandas API on Spark", and use
> "PS"  (pandas-on-Spark) in PR titles? We already use "ps" in many places
> when we: import pyspark.pandas as ps.
> This is similar to "Structured Streaming" in JIRA, and "SS" in PR title.
>
> I think it'd be easier to track the changes here with that. Currently it's
> a bit difficult to identify it from pure PySpark changes.
>
>


Introducing "Pandas API on Spark" component in JIRA, and use "PS" PR title component

2022-05-16 Thread Hyukjin Kwon
Hi all,

What about we introduce a component in JIRA "Pandas API on Spark", and use
"PS"  (pandas-on-Spark) in PR titles? We already use "ps" in many places
when we: import pyspark.pandas as ps.
This is similar to "Structured Streaming" in JIRA, and "SS" in PR title.

I think it'd be easier to track the changes here with that. Currently it's
a bit difficult to identify it from pure PySpark changes.


Re: SIGMOD System Award for Apache Spark

2022-05-12 Thread Hyukjin Kwon
Awesome!

On Fri, May 13, 2022 at 5:29 AM Mosharaf Chowdhury 
wrote:

> Wow! Congratulations to everyone indeed.
>
> On Thu, May 12, 2022 at 3:44 PM Matei Zaharia 
> wrote:
>
>> Hi all,
>>
>> We recently found out that Apache Spark received
>>  the SIGMOD System Award this
>> year, given by SIGMOD (the ACM’s data management research organization) to
>> impactful real-world and research systems. This puts Spark in good company
>> with some very impressive previous recipients
>> . This award is
>> really an achievement by the whole community, so I wanted to say congrats
>> to everyone who contributes to Spark, whether through code, issue reports,
>> docs, or other means.
>>
>> Matei
>>
>


Re: Contributor data in github-page no longer updated after May 1

2022-05-11 Thread Hyukjin Kwon
It's very likely a GitHub issue

On Wed, 11 May 2022 at 18:01, Yang,Jie(INF)  wrote:

> Hi, teams
>
>
>
> The contributors data in the following page seems no longer updated after
> May 1,  Can anyone fix it?
>
>
>
>
> https://github.com/apache/spark/graphs/contributors?from=2022-05-01=2022-05-11=c
>
>
>
> Warm regards,
>
> YangJie
>
>
>


Re: [VOTE] Release Spark 3.3.0 (RC1)

2022-05-10 Thread Hyukjin Kwon
I expect to see RC2 too. I guess he just sticks to the standard, leaving
the vote open till the end.
It hasn't got enough +1s anyway :-).

On Wed, 11 May 2022 at 10:17, Holden Karau  wrote:

> Technically release don't follow vetos (see
> https://www.apache.org/foundation/voting.html ) it's up to the RM if they
> get the minimum number of binding +1s (although they are encouraged to
> cancel the release if any serious issues are raised).
>
> That being said I'll add my -1 based on the issues reported in this thread.
>
> On Tue, May 10, 2022 at 6:07 PM Sean Owen  wrote:
>
>> There's a -1 vote here, so I think this RC fails anyway.
>>
>> On Fri, May 6, 2022 at 10:30 AM Gengliang Wang  wrote:
>>
>>> Hi Maxim,
>>>
>>> Thanks for the work!
>>> There is a bug fix from Bruce merged on branch-3.3 right after the RC1
>>> is cut:
>>> SPARK-39093: Dividing interval by integral can result in codegen
>>> compilation error
>>> 
>>>
>>> So -1 from me. We should have RC2 to include the fix.
>>>
>>> Thanks
>>> Gengliang
>>>
>>> On Fri, May 6, 2022 at 6:15 PM Maxim Gekk
>>>  wrote:
>>>
 Hi Dongjoon,

  > https://issues.apache.org/jira/projects/SPARK/versions/12350369
 > Since RC1 is started, could you move them out from the 3.3.0
 milestone?

 I have removed the 3.3.0 label from Fix version(s). Thank you, Dongjoon.

 Maxim Gekk

 Software Engineer

 Databricks, Inc.


 On Fri, May 6, 2022 at 11:06 AM Dongjoon Hyun 
 wrote:

> Hi, Sean.
> It's interesting. I didn't see those failures from my side.
>
> Hi, Maxim.
> In the following link, there are 17 in-progress and 6 to-do JIRA
> issues which look irrelevant to this RC1 vote.
>
> https://issues.apache.org/jira/projects/SPARK/versions/12350369
>
> Since RC1 is started, could you move them out from the 3.3.0 milestone?
> Otherwise, we cannot distinguish new real blocker issues from those
> obsolete JIRA issues.
>
> Thanks,
> Dongjoon.
>
>
> On Thu, May 5, 2022 at 11:46 AM Adam Binford 
> wrote:
>
>> I looked back at the first one (SPARK-37618), it expects/assumes a
>> 0022 umask to correctly test the behavior. I'm not sure how to get that 
>> to
>> not fail or be ignored with a more open umask.
>>
>> On Thu, May 5, 2022 at 1:56 PM Sean Owen  wrote:
>>
>>> I'm seeing test failures; is anyone seeing ones like this? This is
>>> Java 8 / Scala 2.12 / Ubuntu 22.04:
>>>
>>> - SPARK-37618: Sub dirs are group writable when removing from
>>> shuffle service enabled *** FAILED ***
>>>   [OWNER_WRITE, GROUP_READ, GROUP_WRITE, GROUP_EXECUTE, OTHERS_READ,
>>> OWNER_READ, OTHERS_EXECUTE, OWNER_EXECUTE] contained GROUP_WRITE
>>> (DiskBlockManagerSuite.scala:155)
>>>
>>> - Check schemas for expression examples *** FAILED ***
>>>   396 did not equal 398 Expected 396 blocks in result file but got
>>> 398. Try regenerating the result files. 
>>> (ExpressionsSchemaSuite.scala:161)
>>>
>>>  Function 'bloom_filter_agg', Expression class
>>> 'org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate'
>>> "" did not start with "
>>>   Examples:
>>>   " (ExpressionInfoSuite.scala:142)
>>>
>>> On Thu, May 5, 2022 at 6:01 AM Maxim Gekk
>>>  wrote:
>>>
 Please vote on releasing the following candidate as Apache Spark
  version 3.3.0.

 The vote is open until 11:59pm Pacific time May 10th and passes if
 a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.

 [ ] +1 Release this package as Apache Spark 3.3.0
 [ ] -1 Do not release this package because ...

 To learn more about Apache Spark, please see http://spark
 .apache.org/

 The tag to be voted on is v3.3.0-rc1 (commit
 482b7d54b522c4d1e25f3e84eabbc78126f22a3d):
 https://github.com/apache/spark/tree/v3.3.0-rc1

 The release files, including signatures, digests, etc. can be
 found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-bin/

 Signatures used for Spark RCs can be found in this file:
 https://dist.apache.org/repos/dist/dev/spark/KEYS

 The staging repository for this release can be found at:

 https://repository.apache.org/content/repositories/orgapachespark-1402

 The documentation corresponding to this release can be found at:
 https://dist.apache.org/repos/dist/dev/spark/v3.3.0-rc1-docs/

 The list of bug fixes going into 3.3.0 can be found at the
 following URL:
 https://issues.apache.org/jira/projects/SPARK/versions/12350369

 This release is using the release script of the tag 

Re: PR builder not working now

2022-04-19 Thread Hyukjin Kwon
It's fixed now.

On Tue, 19 Apr 2022 at 08:33, Hyukjin Kwon  wrote:

> It's still persistent. I will send an email to GitHub support today
>
> On Wed, 13 Apr 2022 at 11:04, Dongjoon Hyun 
> wrote:
>
>> Thank you for sharing that information!
>>
>> Bests
>> Dongjoon.
>>
>>
>> On Mon, Apr 11, 2022 at 10:29 PM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> There is a bug in GitHub Actions' RESTful API (see
>>> https://github.com/HyukjinKwon/spark/actions?query=branch%3Adebug-ga-detection
>>> as an example).
>>> So, currently OSS PR builder doesn't work properly with showing a screen
>>> such as
>>> https://github.com/apache/spark/pull/36157/checks?check_run_id=5984075130
>>> because we rely on that.
>>>
>>> To check the PR builder's status, we should manually find the workflow
>>> run in PR author's repository for now by going to:
>>> https://github.com/[PR AUTHOR
>>> ID]/spark/actions/workflows/build_and_test.yml
>>>
>>


Re: PR builder not working now

2022-04-18 Thread Hyukjin Kwon
It's still persistent. I will send an email to GitHub support today

On Wed, 13 Apr 2022 at 11:04, Dongjoon Hyun  wrote:

> Thank you for sharing that information!
>
> Bests
> Dongjoon.
>
>
> On Mon, Apr 11, 2022 at 10:29 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> There is a bug in GitHub Actions' RESTful API (see
>> https://github.com/HyukjinKwon/spark/actions?query=branch%3Adebug-ga-detection
>> as an example).
>> So, currently OSS PR builder doesn't work properly with showing a screen
>> such as
>> https://github.com/apache/spark/pull/36157/checks?check_run_id=5984075130
>> because we rely on that.
>>
>> To check the PR builder's status, we should manually find the workflow
>> run in PR author's repository for now by going to: https://github.com/[PR
>> AUTHOR ID]/spark/actions/workflows/build_and_test.yml
>>
>


PR builder not working now

2022-04-11 Thread Hyukjin Kwon
Hi all,

There is a bug in GitHub Actions' RESTful API (see
https://github.com/HyukjinKwon/spark/actions?query=branch%3Adebug-ga-detection
as an example).
So, currently OSS PR builder doesn't work properly with showing a screen
such as
https://github.com/apache/spark/pull/36157/checks?check_run_id=5984075130
because we rely on that.

To check the PR builder's status, we should manually find the workflow run
in PR author's repository for now by going to: https://github.com/[PR
AUTHOR ID]/spark/actions/workflows/build_and_test.yml


[DISCUSS] Rename 'SQL' to 'SQL / DataFrame', and 'Query' to 'Execution' in SQL UI page

2022-03-27 Thread Hyukjin Kwon
Hi all,

I have been investigating the improvements for Pandas API on Spark
specifically in UI.
I chatted with a couple of people, and decided to send an email here to
discuss more.

Currently, both SQL and DataFrame API are shown in “SQL” tab as below:

[image: Screen Shot 2022-03-25 at 12.18.14 PM.png]

which makes sense to developers because DataFrame API shares the same SQL
core but
I do believe this makes less sense to end users. Please consider two more
points:

   - Spark ML users will run DataFrame-based MLlib API, but they will have
   to check the "SQL" tab.
   - Pandas API on Spark arguably has no link to SQL itself conceptually.
   It makes less sense to users of pandas API.


So I would like to propose to rename:

   - "SQL" to "SQL/DataFrame"
   - "Query" to "Execution"


There's a PR open at https://github.com/apache/spark/pull/35973. Please
let me know your thoughts on this.

Thanks.


Re: Conda Python Env in K8S

2021-12-24 Thread Hyukjin Kwon
Can you share the logs, settings, environment, etc. and file a JIRA? There
are integration test cases for K8S support, and I myself also tested it
before.
It would be helpful if you try what I did at
https://databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html
and see if it works.

On Mon, 6 Dec 2021 at 17:22, Bode, Meikel, NMA-CFD <
meikel.b...@bertelsmann.de> wrote:

> Hi Mich,
>
>
>
> Thanks for your response. Yes –py-files options works. I also tested it.
>
> The question is why the –archives option doesn’t?
>
>
>
> From Jira I can see that it should be available since 3.1.0:
>
>
>
> https://issues.apache.org/jira/browse/SPARK-33530
>
> https://issues.apache.org/jira/browse/SPARK-33615
>
>
>
> Best,
>
> Meikel
>
>
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Samstag, 4. Dezember 2021 18:36
> *To:* Bode, Meikel, NMA-CFD 
> *Cc:* dev ; u...@spark.apache.org
> *Subject:* Re: Conda Python Env in K8S
>
>
>
>
> Hi Meikel
>
>
>
> In the past I tried with
>
>
>
>--py-files
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip \
>
>--archives
> hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/pyspark_venv.zip#pyspark_venv \
>
>
>
> which is basically what you are doing. the first line --py-files works but
> the second one fails
>
>
>
> It tried to unpack them ? It tries to unpack them
>
>
>
> Unpacking an archive hdfs://
> 50.140.197.220:9000/minikube/codes/pyspark_venv.zip#pyspark_venv
> 
>  from
> /tmp/spark-502a5b57-0fe6-45bd-867d-9738e678e9a3/pyspark_venv.zip to
> /opt/spark/work-dir/./pyspark_venv
>
>
>
> But it failed.
>
>
>
> This could be due to creating the virtual environment inside the docker in
> the work-dir *o*r sometimes when there is not enough available memory to
> gunzip and untar the file, especially if your executors are built on
> cluster nodes with less memory than the driver node.
>
>
>
> However, The most convenient way to add additional packages to the docker
> image is to add them directly to the docker image at time of creating the
> image. So external packages are bundled as a part of my docker image
> because it is fixed and if an application requires those set of
> dependencies every time, they are there. Also note that every time you put
> RUN statement it creates an intermediate container and hence it increases
> build time. So reduce it as follows
>
> RUN pip install pyyaml numpy cx_Oracle --no-cache-dir
>
> The --no-cheche-dir option to pip is to prevent the downloaded binaries from 
> being added to the image, reducing the image size. It is also advisable to 
> install all packages in one line. Every time you put RUN statement it creates 
> an intermediate container and hence it increases the build time. So reduce it 
> by putting all packages in one line.
>
> Log in to the docker image and check for Python packages installed
>
> docker run -u 0 -it 
> spark/spark-py:3.1.1-scala_2.12-8-jre-slim-buster_java8PlusPackages bash
>
> root@5bc049af7278:/opt/spark/work-dir# pip list
>
> PackageVersion
>
> -- ---
>
> cx-Oracle  8.3.0
>
> numpy  1.21.4
>
> pip21.3.1
>
> PyYAML 6.0
>
> setuptools 59.4.0
>
> wheel  0.34.2
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Sat, 4 Dec 2021 at 07:52, Bode, Meikel, NMA-CFD <
> meikel.b...@bertelsmann.de> wrote:
>
> Hi Mich,
>
>
>
> sure thats possible. But distributing the complete env would be more
> practical.
>
> A workaround at the moment is, that we build different environments and
> store them in a pv and then we mount it into the pods and refer from the
> SparkApplication resource to the desired env..
>
>
>
> But actually these options exist and I want to understand what the issue
> is…
>
> Any hints on that?
>
>
>
> Best,
>
> Meikel
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* 

Re: Hadoop profile change to hadoop-2 and hadoop-3 since Spark 3.3

2021-12-11 Thread Hyukjin Kwon
and @tgra...@apache.org  too

On Sat, 11 Dec 2021 at 21:38, Hyukjin Kwon  wrote:

> cc @Holden Karau  @DB Tsai  @Imran
> Rashid  @Mridul Muralidharan  FYI
>
> On Thu, 9 Dec 2021 at 14:07, angers zhu  wrote:
>
>> Hi all,
>>
>> Since Spark 3.2, we have supported Hadoop 3.3.1 now, but its profile name
>> is *hadoop-3.2* (and *hadoop-2.7*) that is not correct.
>> So we made a change in https://github.com/apache/spark/pull/34715
>> Starting from Spark 3.3, we use hadoop profile *hadoop-2* and *hadoop-3 *,
>> and default hadoop profile is hadoop-3.
>> Profile changes
>>
>> *hadoop-2.7* changed to *hadoop-2*
>> *hadoop-3.2* changed to *hadoop-3*
>> Release tar file
>>
>> Spark-3.3.0 with profile hadoop-3: *spark-3.3.0-bin-hadoop3.tgz*
>> Spark-3.3.0 with profile hadoop-2: *spark-3.3.0-bin-hadoop2.tgz*
>>
>> For Spark 3.2.0, the release tar file was, for example,
>> *spark-3.2.0-bin-hadoop3.2.tgz*.
>> Pip install option changes
>>
>> For PySpark with/without a specific Hadoop version, you can install it by
>> using PYSPARK_HADOOP_VERSION environment variables as below (Hadoop 3):
>>
>> PYSPARK_HADOOP_VERSION=3 pip install pyspark
>>
>> For Hadoop 2:
>>
>> PYSPARK_HADOOP_VERSION=2 pip install pyspark
>>
>> Supported values in PYSPARK_HADOOP_VERSION are now:
>>
>>- without: Spark pre-built with user-provided Apache Hadoop
>>- 2: Spark pre-built for Apache Hadoop 2.
>>- 3: Spark pre-built for Apache Hadoop 3.3 and later (default)
>>
>> Building Spark and Specifying the Hadoop Version
>> <https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn>
>>
>> You can specify the exact version of Hadoop to compile against through
>> the hadoop.version property.
>> Example:
>>
>> ./build/mvn -Pyarn -Dhadoop.version=3.3.0 -DskipTests clean package
>>
>> or you can specify *hadoop-3* profile
>>
>> ./build/mvn -Pyarn -Phadoop-3 -Dhadoop.version=3.3.0 -DskipTests clean 
>> package
>>
>> If you want to build with Hadoop 2.x, enable *hadoop-2* profile:
>>
>> ./build/mvn -Phadoop-2 -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean 
>> package
>>
>> Notes
>>
>> In the current master, it will use the default Hadoop 3 if you continue
>> to use -Phadoop-2.7 and -Phadoop-3.2 to build Spark
>> because Maven or SBT will just warn and ignore these non-existent
>> profiles.
>> Please change profiles to -Phadoop-2 or -Phadoop-3.
>>
>


Re: Hadoop profile change to hadoop-2 and hadoop-3 since Spark 3.3

2021-12-11 Thread Hyukjin Kwon
cc @Holden Karau  @DB Tsai  @Imran
Rashid  @Mridul Muralidharan  FYI

On Thu, 9 Dec 2021 at 14:07, angers zhu  wrote:

> Hi all,
>
> Since Spark 3.2, we have supported Hadoop 3.3.1 now, but its profile name
> is *hadoop-3.2* (and *hadoop-2.7*) that is not correct.
> So we made a change in https://github.com/apache/spark/pull/34715
> Starting from Spark 3.3, we use hadoop profile *hadoop-2* and *hadoop-3 *,
> and default hadoop profile is hadoop-3.
> Profile changes
>
> *hadoop-2.7* changed to *hadoop-2*
> *hadoop-3.2* changed to *hadoop-3*
> Release tar file
>
> Spark-3.3.0 with profile hadoop-3: *spark-3.3.0-bin-hadoop3.tgz*
> Spark-3.3.0 with profile hadoop-2: *spark-3.3.0-bin-hadoop2.tgz*
>
> For Spark 3.2.0, the release tar file was, for example,
> *spark-3.2.0-bin-hadoop3.2.tgz*.
> Pip install option changes
>
> For PySpark with/without a specific Hadoop version, you can install it by
> using PYSPARK_HADOOP_VERSION environment variables as below (Hadoop 3):
>
> PYSPARK_HADOOP_VERSION=3 pip install pyspark
>
> For Hadoop 2:
>
> PYSPARK_HADOOP_VERSION=2 pip install pyspark
>
> Supported values in PYSPARK_HADOOP_VERSION are now:
>
>- without: Spark pre-built with user-provided Apache Hadoop
>- 2: Spark pre-built for Apache Hadoop 2.
>- 3: Spark pre-built for Apache Hadoop 3.3 and later (default)
>
> Building Spark and Specifying the Hadoop Version
> 
>
> You can specify the exact version of Hadoop to compile against through the
> hadoop.version property.
> Example:
>
> ./build/mvn -Pyarn -Dhadoop.version=3.3.0 -DskipTests clean package
>
> or you can specify *hadoop-3* profile
>
> ./build/mvn -Pyarn -Phadoop-3 -Dhadoop.version=3.3.0 -DskipTests clean package
>
> If you want to build with Hadoop 2.x, enable *hadoop-2* profile:
>
> ./build/mvn -Phadoop-2 -Pyarn -Dhadoop.version=2.8.5 -DskipTests clean package
>
> Notes
>
> In the current master, it will use the default Hadoop 3 if you continue to
> use -Phadoop-2.7 and -Phadoop-3.2 to build Spark
> because Maven or SBT will just warn and ignore these non-existent profiles.
> Please change profiles to -Phadoop-2 or -Phadoop-3.
>


Re: Time for Spark 3.2.1?

2021-12-07 Thread Hyukjin Kwon
SGTM!

On Wed, 8 Dec 2021 at 09:07, huaxin gao  wrote:

> I prefer to start rolling the release in January if there is no need to
> publish it sooner :)
>
> On Tue, Dec 7, 2021 at 3:59 PM Hyukjin Kwon  wrote:
>
>> Oh BTW, I realised that it's a holiday season soon this month including
>> Christmas and new year.
>> Shall we maybe start rolling the release around next January? I would
>> leave it to @huaxin gao  :-).
>>
>> On Wed, 8 Dec 2021 at 06:19, Dongjoon Hyun 
>> wrote:
>>
>>> +1 for new releases.
>>>
>>> Dongjoon.
>>>
>>> On Mon, Dec 6, 2021 at 8:51 PM Wenchen Fan  wrote:
>>>
>>>> +1 to make new maintenance releases for all 3.x branches.
>>>>
>>>> On Tue, Dec 7, 2021 at 8:57 AM Sean Owen  wrote:
>>>>
>>>>> Always fine by me if someone wants to roll a release.
>>>>>
>>>>> It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a
>>>>> new release of those wouldn't hurt either, if any of our release managers
>>>>> have the time or inclination. 3.0.x is reaching unofficial end-of-life
>>>>> around now anyway.
>>>>>
>>>>>
>>>>> On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> It's been two months since Spark 3.2.0 release, and we have resolved
>>>>>> many bug fixes and regressions. What do you guys think about rolling 
>>>>>> Spark
>>>>>> 3.2.1 release?
>>>>>>
>>>>>> cc @huaxin gao  FYI who I happened to
>>>>>> overhear that is interested in rolling the maintenance release :-).
>>>>>>
>>>>>


Re: Time for Spark 3.2.1?

2021-12-07 Thread Hyukjin Kwon
Oh BTW, I realised that it's a holiday season soon this month including
Christmas and new year.
Shall we maybe start rolling the release around next January? I would leave
it to @huaxin gao  :-).

On Wed, 8 Dec 2021 at 06:19, Dongjoon Hyun  wrote:

> +1 for new releases.
>
> Dongjoon.
>
> On Mon, Dec 6, 2021 at 8:51 PM Wenchen Fan  wrote:
>
>> +1 to make new maintenance releases for all 3.x branches.
>>
>> On Tue, Dec 7, 2021 at 8:57 AM Sean Owen  wrote:
>>
>>> Always fine by me if someone wants to roll a release.
>>>
>>> It's been ~6 months since the last 3.0.x and 3.1.x releases, too; a new
>>> release of those wouldn't hurt either, if any of our release managers have
>>> the time or inclination. 3.0.x is reaching unofficial end-of-life around
>>> now anyway.
>>>
>>>
>>> On Mon, Dec 6, 2021 at 6:55 PM Hyukjin Kwon  wrote:
>>>
>>>> Hi all,
>>>>
>>>> It's been two months since Spark 3.2.0 release, and we have resolved
>>>> many bug fixes and regressions. What do you guys think about rolling Spark
>>>> 3.2.1 release?
>>>>
>>>> cc @huaxin gao  FYI who I happened to overhear
>>>> that is interested in rolling the maintenance release :-).
>>>>
>>>


Time for Spark 3.2.1?

2021-12-06 Thread Hyukjin Kwon
Hi all,

It's been two months since Spark 3.2.0 release, and we have resolved many
bug fixes and regressions. What do you guys think about rolling Spark 3.2.1
release?

cc @huaxin gao  FYI who I happened to overhear that
is interested in rolling the maintenance release :-).


Re: [Apache Spark Jenkins] build system shutting down Dec 23th, 2021

2021-12-06 Thread Hyukjin Kwon
Thanks, Shane.

On Tue, 7 Dec 2021 at 09:19, Dongjoon Hyun  wrote:

> I really want to thank you for all your help.
> You've done so many things for the Apache Spark community.
>
> Sincerely,
> Dongjoon
>
>
> On Mon, Dec 6, 2021 at 12:02 PM shane knapp ☠  wrote:
>
>> hey everyone!
>>
>> after a marathon run of nearly a decade, we're finally going to be
>> shutting down {amp|rise}lab jenkins at the end of this month...
>>
>> the earliest snapshot i could find is from 2013 with builds for spark 0.7:
>>
>> https://web.archive.org/web/20130426155726/https://amplab.cs.berkeley.edu/jenkins/
>>
>> it's been a hell of a run, and i'm gonna miss randomly tweaking the build
>> system, but technology has moved on and running a dedicated set of servers
>> for just one open source project is just too expensive for us here at uc
>> berkeley.
>>
>> if there's interest, i'll fire up a zoom session and all y'alls can watch
>> me type the final command:
>>
>> systemctl stop jenkins
>>
>> feeling bittersweet,
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>


Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

2021-11-30 Thread Hyukjin Kwon
Adding @Holden Karau  @Dongjoon Hyun
 @wuyi  FYI

On Tue, 30 Nov 2021 at 17:46, Yikun Jiang  wrote:

> Hey everyone,
>
> I'd like to start a discussion on "Support Volcano/Alternative Schedulers
> Proposal".
>
> This SPIP is proposed to make spark k8s schedulers provide more YARN like
> features (such as queues and minimum resources before scheduling jobs) that
> many folks want on Kubernetes.
>
> The goal of this SPIP is to improve current spark k8s scheduler
> implementations, add the ability of batch scheduling and support volcano as
> one of implementations.
>
> Design doc:
> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
> Part of PRs:
> Ability to create resources https://github.com/apache/spark/pull/34599
> Add PodGroupFeatureStep: https://github.com/apache/spark/pull/34456
>
> Regards,
> Yikun
>


Re: Jira components cleanup

2021-11-28 Thread Hyukjin Kwon
Thanks Nicholas for raising this, and Sean for updating it!

On Tue, 16 Nov 2021 at 03:27, Sean Owen  wrote:

> Done. Now let's see if that generated 86 update emails!
>
> On Mon, Nov 15, 2021 at 11:03 AM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin:components-page
>>
>> I think the "docs" component should be merged into "Documentation".
>>
>> Likewise, the "k8" component should be merged into "Kubernetes".
>>
>> I think anyone can technically update tags, but I think mass retagging
>> should be limited to admins (or at least, to someone who got prior approval
>> from an admin).
>>
>> Nick
>>
>>


Re: Supports Dynamic Table Options for Spark SQL

2021-11-15 Thread Hyukjin Kwon
My biggest concern with the syntax in hints is that Spark SQL's options can
change results (e.g., CSV's header options) whereas hints are generally not
designed to affect the external results if I am not mistaken. This is
counterintuitive.
I left the comment in the PR but what's the real benefit over leveraging:
SET conf and RESET conf? we can extract options from runtime session
configurations e.g., SessionConfigSupport.

On Tue, 16 Nov 2021 at 04:30, Nicholas Chammas 
wrote:

> Side note about time travel: There is a PR
>  to add VERSION/TIMESTAMP AS
> OF syntax to Spark SQL.
>
> On Mon, Nov 15, 2021 at 2:23 PM Ryan Blue  wrote:
>
>> I want to note that I wouldn't recommend time traveling this way by using
>> the hint for `snapshot-id`. Instead, we want to add the standard SQL syntax
>> for that in a separate PR. This is useful for other options that help a
>> table scan perform better, like specifying the target split size.
>>
>> You're right that this isn't a typical optimizer hint, but I'm not sure
>> what other syntax is possible for this use case. How else would we send
>> custom properties through to the scan?
>>
>> On Mon, Nov 15, 2021 at 9:25 AM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> I am looking at the hint and it appears to me (I stand corrected), it is
>>> a single table hint as below:
>>>
>>> -- time travel
>>> SELECT * FROM t /*+ OPTIONS('snapshot-id'='10963874102873L') */
>>>
>>> My assumption is that any view on this table will also benefit from this
>>> hint. This is not a hint to optimizer in a classical sense. Only a snapshot
>>> hint. Normally, a hint is an instruction to the optimizer. When writing
>>> SQL, one may know information about the data unknown to the optimizer.
>>> Hints enable one to make decisions normally made by the optimizer,
>>> sometimes causing the optimizer to select a plan that it sees as higher
>>> cost.
>>>
>>>
>>> So far as this case is concerned, it looks OK and I concur it should be
>>> extended to write as well.
>>>
>>>
>>> HTH
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 15 Nov 2021 at 17:02, Russell Spitzer 
>>> wrote:
>>>
 I think since we probably will end up using this same syntax on write,
 this makes a lot of sense. Unless there is another good way to express a
 similar concept during a write operation I think going forward with this
 would be ok.

 On Mon, Nov 15, 2021 at 10:44 AM Ryan Blue  wrote:

> The proposed feature is to be able to pass options through SQL like
> you would when using the DataFrameReader API, so it would work for
> all sources that support read options. Read options are part of the DSv2
> API, there just isn’t a way to pass options when using SQL. The PR also 
> has
> a non-Iceberg example, which is being able to customize some JDBC source
> behaviors per query (e.g., fetchSize), rather than globally in the table’s
> options.
>
> The proposed syntax is odd, but I think that's an artifact of Spark
> introducing read options that aren't a normal part of SQL. Seems 
> reasonable
> to me to pass them through a hint.
>
> On Mon, Nov 15, 2021 at 2:18 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Interesting.
>>
>> What is this going to add on top of support for Apache Iceberg
>> . Will it be in
>> line with support for Hive ACID tables or Delta Lake?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>> for any loss, damage or destruction of data or any other property which 
>> may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 15 Nov 2021 at 01:56, Zhun Wang 
>> wrote:
>>
>>> Hi dev,
>>>
>>> We are discussing Support Dynamic Table Options for Spark SQL (
>>> https://github.com/apache/spark/pull/34072). It is currently not
>>> sure if the syntax makes sense, and would like to know if there is other
>>> feedback or opinion on this.
>>>
>>> I would appreciate any feedback on this.
>>>
>>> Thanks.
>>>
>>

Re: [FYI] Build and run tests on Java 17 for Apache Spark 3.3

2021-11-12 Thread Hyukjin Kwon
Awesome!

On Sat, Nov 13, 2021 at 12:04 PM Xiao Li  wrote:

> Thank you! Great job!
>
> Xiao
>
>
> On Fri, Nov 12, 2021 at 7:02 PM Mridul Muralidharan 
> wrote:
>
>>
>> Nice job !
>> There are some nice API's which should be interesting to explore with JDK
>> 17 :-)
>>
>> Regards.
>> Mridul
>>
>> On Fri, Nov 12, 2021 at 7:08 PM Yuming Wang  wrote:
>>
>>> Cool, thank you Dongjoon.
>>>
>>> On Sat, Nov 13, 2021 at 4:09 AM shane knapp ☠ 
>>> wrote:
>>>
 woot!  nice work everyone!  :)

 On Fri, Nov 12, 2021 at 11:37 AM Dongjoon Hyun 
 wrote:

> Hi, All.
>
> Apache Spark community has been working on Java 17 support under the
> following JIRA.
>
> https://issues.apache.org/jira/browse/SPARK-33772
>
> As of today, Apache Spark starts to have daily Java 17 test coverage
> via GitHub Action jobs for Apache Spark 3.3.
>
>
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L38-L39
>
> Today's successful run is here.
>
> https://github.com/apache/spark/actions/runs/1453788012
>
> Please note that we are still working on some new Java 17 features
> like
>
> JEP 391: macOS/AArch64 Port
> https://bugs.openjdk.java.net/browse/JDK-8251280
>
> For example, Oracle Java, Azul Zulu, and Eclipse Temurin Java 17
> already support Apple Silicon natively, but some 3rd party libraries like
> RocksDB/LevelDB are not ready yet. Since Mac is one of the popular dev
> environments, we are going to keep monitoring and improving gradually for
> Apache Spark 3.3.
>
> Please test Java 17 and let us know your feedback.
>
> Thanks,
> Dongjoon.
>


 --
 Shane Knapp
 Computer Guy / Voice of Reason
 UC Berkeley EECS Research / RISELab Staff Technical Lead
 https://rise.cs.berkeley.edu

>>>
>
> --
>
>


Re: DataFrame.mapInArrow

2021-11-10 Thread Hyukjin Kwon
Sure, thanks Holden :-).

On Thu, 11 Nov 2021 at 15:53, Holden Karau  wrote:

> Sorry I've been busy, I'll try and take a look tomorrow, excited to see
> this progress though :)
>
> On Wed, Nov 10, 2021 at 9:01 PM Hyukjin Kwon  wrote:
>
>> Last reminder: I plan to merge this in a few more days. Any feedback and
>> review would be very appreciated.
>>
>> On Tue, 9 Nov 2021 at 21:51, Hyukjin Kwon  wrote:
>>
>>> Hi dev,
>>>
>>> I proposed DataFrame.mapInArrow (
>>> https://github.com/apache/spark/pull/34505) which allows users to
>>> directly leverage Arrow batch to plug in other external systems easily.
>>>
>>> I would like to make sure this design of API covers most use cases, and
>>> would like to know if there is other feedback or opinion on this.
>>>
>>> I would appreciate any feedback on this.
>>>
>>> Thanks.
>>>
>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: DataFrame.mapInArrow

2021-11-10 Thread Hyukjin Kwon
Last reminder: I plan to merge this in a few more days. Any feedback and
review would be very appreciated.

On Tue, 9 Nov 2021 at 21:51, Hyukjin Kwon  wrote:

> Hi dev,
>
> I proposed DataFrame.mapInArrow (
> https://github.com/apache/spark/pull/34505) which allows users to
> directly leverage Arrow batch to plug in other external systems easily.
>
> I would like to make sure this design of API covers most use cases, and
> would like to know if there is other feedback or opinion on this.
>
> I would appreciate any feedback on this.
>
> Thanks.
>


DataFrame.mapInArrow

2021-11-09 Thread Hyukjin Kwon
Hi dev,

I proposed DataFrame.mapInArrow (https://github.com/apache/spark/pull/34505)
which allows users to directly leverage Arrow batch to plug in other
external systems easily.

I would like to make sure this design of API covers most use cases, and
would like to know if there is other feedback or opinion on this.

I would appreciate any feedback on this.

Thanks.


Update Spark 3.3 release window?

2021-10-27 Thread Hyukjin Kwon
Hi all,

Spark 3.2. is out. Shall we update the release window
https://spark.apache.org/versioning-policy.html?
I am thinking of Mid March 2022 (5 months after the 3.2 release) for code
freeze and onward.


Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

2021-10-26 Thread Hyukjin Kwon
Seems making sense to me.

Would be great to have some feedback from people such as @Wenchen Fan
 @Cheng Su  @angers zhu
.


On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun  wrote:

> +1 for this SPIP.
>
> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao  wrote:
>
>> +1. Thanks for lifting the current restrictions on bucket join and making
>> this more generalized.
>>
>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue  wrote:
>>
>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>> point!
>>>
>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai  wrote:
>>>
 +1 on this SPIP.

 This is a more generalized version of bucketed tables and bucketed
 joins which can eliminate very expensive data shuffles when joins, and
 many users in the Apache Spark community have wanted this feature for
 a long time!

 Thank you, Ryan and Chao, for working on this, and I look forward to
 it as a new feature in Spark 3.3

 DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

 On Fri, Oct 22, 2021 at 12:18 PM Chao Sun  wrote:
 >
 > Hi,
 >
 > Ryan and I drafted a design doc to support a new type of join:
 storage partitioned join which covers bucket join support for DataSourceV2
 but is more general. The goal is to let Spark leverage distribution
 properties reported by data sources and eliminate shuffle whenever 
 possible.
 >
 > Design doc:
 https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
 (includes a POC link at the end)
 >
 > We'd like to start a discussion on the doc and any feedback is
 welcome!
 >
 > Thanks,
 > Chao

>>>
>>>
>>> --
>>> Ryan Blue
>>>
>>


Re: Adding Spark 4 to JIRA for targetted versions

2021-09-13 Thread Hyukjin Kwon
BTW, I vaguely remember that adding a new version affects the default
version for the merging script to use for JIRA resolution. e.g., now it's
3.3.0 but it becomes 4.0.0 ...
Maybe it's nicer to double check how it's affected.

2021년 9월 14일 (화) 오후 1:32, Dongjoon Hyun 님이 작성:

> I'm fine to have the version number, but breaking API compatibility should
> be discussed separately in the community.
> We decided to strive to avoid breaking APIs even in major versions and
> made a policy for that.
>
> https://spark.apache.org/versioning-policy.html
> > The Spark project strives to avoid breaking APIs or silently changing
> behavior, even at major versions.
>
>
>
> On Mon, Sep 13, 2021 at 9:00 PM Senthil Kumar  wrote:
>
>> We can have a feature(new tab) in Spark UI for Data, so that we can use
>> it to display data related metrics and detect skewness in the data. It will
>> be helpful to the users to understand their data in a better/deeper way.
>>
>> On Tue, Sep 14, 2021 at 4:07 AM Sean Owen  wrote:
>>
>>> Sure, doesn't hurt to have a placeholder.
>>>
>>> On Mon, Sep 13, 2021, 5:32 PM Holden Karau  wrote:
>>>
 Hi Folks,

 I'm going through the Spark 3.2 tickets just to make sure were not
 missing anything important and I was wondering what folks thoughts are on
 adding Spark 4 so we can target API breaking changes to the next major
 version and avoid loosing track of the issue.

 Cheers,


 Holden :)

 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>>
>> --
>> Senthil kumar
>>
>


Re: CRAN package SparkR

2021-09-01 Thread Hyukjin Kwon
Made a quick fix: https://github.com/apache/spark/pull/33887
I would very appreciate if you guys double check and test against my change
for doubly sure ..

adding @Shivaram Venkataraman  too FYI

2021년 9월 1일 (수) 오전 11:56, Felix Cheung 님이 작성:

> I think a few lines to add the prompt might be enough. This checks for
> interactive()
>
>
> https://github.com/apache/spark/blob/c6a2021fec5bab9069fbfba33f75d4415ea76e99/R/pkg/R/sparkR.R#L658
>
>
> On Tue, Aug 31, 2021 at 5:55 PM Hyukjin Kwon  wrote:
>
>> Oh I missed this. Yes, can we simply get the user' confirmation when we
>> install.spark?
>> IIRC, the auto installation is only triggered by interactive shell so
>> getting user's confirmation should be fine.
>>
>> 2021년 6월 18일 (금) 오전 2:54, Felix Cheung 님이 작성:
>>
>>> Any suggestion or comment on this? They are going to remove the package
>>> by 6-28
>>>
>>> Seems to me if we have a switch to opt in to install (and not by default
>>> on), or prompt the user in interactive session, should be good as user
>>> confirmation.
>>>
>>>
>>>
>>> On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
>>> wrote:
>>>
>>>> It looks like they would not allow caching the Spark
>>>> Distribution.
>>>>
>>>> I’m not sure what can be done about this.
>>>>
>>>> If I recall, the package should remove this during test. Or maybe
>>>> spark.install() ie optional (hence getting user confirmation?)
>>>>
>>>>
>>>> -- Forwarded message -
>>>> Date: Sun, Jun 13, 2021 at 10:19 PM
>>>> Subject: CRAN package SparkR
>>>> To: Felix Cheung 
>>>> CC: 
>>>>
>>>>
>>>> Dear maintainer,
>>>>
>>>> Checking this apparently creates the default directory as per
>>>>
>>>> #' @param localDir a local directory where Spark is installed. The
>>>> directory con
>>>> tains
>>>> #' version-specific folders of Spark packages. Default
>>>> is path t
>>>> o
>>>> #' the cache directory:
>>>> #' \itemize{
>>>> #'   \item Mac OS X: \file{~/Library/Caches/spark}
>>>> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
>>>> otherwise \file{~/.cache/spark}
>>>> #'   \item Windows:
>>>> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
>>>> #' }
>>>>
>>>> However, the CRAN Policy says
>>>>
>>>>   - Packages should not write in the user’s home filespace (including
>>>> clipboards), nor anywhere else on the file system apart from the R
>>>> session’s temporary directory (or during installation in the
>>>> location pointed to by TMPDIR: and such usage should be cleaned
>>>> up). Installing into the system’s R installation (e.g., scripts to
>>>> its bin directory) is not allowed.
>>>>
>>>> Limited exceptions may be allowed in interactive sessions if the
>>>> package obtains confirmation from the user.
>>>>
>>>> For R version 4.0 or later (hence a version dependency is required
>>>> or only conditional use is possible), packages may store
>>>> user-specific data, configuration and cache files in their
>>>> respective user directories obtained from tools::R_user_dir(),
>>>> provided that by default sizes are kept as small as possible and the
>>>> contents are actively managed (including removing outdated
>>>> material).
>>>>
>>>> Can you pls fix as necessary?
>>>>
>>>> Please fix before 2021-06-28 to safely retain your package on CRAN.
>>>>
>>>> Best
>>>> -k
>>>>
>>>


Re: CRAN package SparkR

2021-08-31 Thread Hyukjin Kwon
Oh I missed this. Yes, can we simply get the user' confirmation when we
install.spark?
IIRC, the auto installation is only triggered by interactive shell so
getting user's confirmation should be fine.

2021년 6월 18일 (금) 오전 2:54, Felix Cheung 님이 작성:

> Any suggestion or comment on this? They are going to remove the package by
> 6-28
>
> Seems to me if we have a switch to opt in to install (and not by default
> on), or prompt the user in interactive session, should be good as user
> confirmation.
>
>
>
> On Sun, Jun 13, 2021 at 11:25 PM Felix Cheung 
> wrote:
>
>> It looks like they would not allow caching the Spark
>> Distribution.
>>
>> I’m not sure what can be done about this.
>>
>> If I recall, the package should remove this during test. Or maybe
>> spark.install() ie optional (hence getting user confirmation?)
>>
>>
>> -- Forwarded message -
>> Date: Sun, Jun 13, 2021 at 10:19 PM
>> Subject: CRAN package SparkR
>> To: Felix Cheung 
>> CC: 
>>
>>
>> Dear maintainer,
>>
>> Checking this apparently creates the default directory as per
>>
>> #' @param localDir a local directory where Spark is installed. The
>> directory con
>> tains
>> #' version-specific folders of Spark packages. Default is
>> path t
>> o
>> #' the cache directory:
>> #' \itemize{
>> #'   \item Mac OS X: \file{~/Library/Caches/spark}
>> #'   \item Unix: \env{$XDG_CACHE_HOME} if defined,
>> otherwise \file{~/.cache/spark}
>> #'   \item Windows:
>> \file{\%LOCALAPPDATA\%\\Apache\\Spark\\Cache}.
>> #' }
>>
>> However, the CRAN Policy says
>>
>>   - Packages should not write in the user’s home filespace (including
>> clipboards), nor anywhere else on the file system apart from the R
>> session’s temporary directory (or during installation in the
>> location pointed to by TMPDIR: and such usage should be cleaned
>> up). Installing into the system’s R installation (e.g., scripts to
>> its bin directory) is not allowed.
>>
>> Limited exceptions may be allowed in interactive sessions if the
>> package obtains confirmation from the user.
>>
>> For R version 4.0 or later (hence a version dependency is required
>> or only conditional use is possible), packages may store
>> user-specific data, configuration and cache files in their
>> respective user directories obtained from tools::R_user_dir(),
>> provided that by default sizes are kept as small as possible and the
>> contents are actively managed (including removing outdated
>> material).
>>
>> Can you pls fix as necessary?
>>
>> Please fix before 2021-06-28 to safely retain your package on CRAN.
>>
>> Best
>> -k
>>
>


Re: -1s on committed but not released code?

2021-08-19 Thread Hyukjin Kwon
Yeah, I think we can discuss and revert it (or fix it) per the veto set.
Often problems are found later after codes are merged.


2021년 8월 20일 (금) 오전 4:08, Mridul Muralidharan 님이 작성:

> Hi Holden,
>
>   In the past, I have seen discussions on the merged pr to thrash out the
> details.
> Usually it would be clear whether to revert and reformulate the change or
> concerns get addressed and possibly result in follow up work.
>
> This is usually helped by the fact that we typically are conservative and
> don’t merge changes too quickly: giving folks sufficient time to review and
> opine.
>
> Regards,
> Mridul
>
> On Thu, Aug 19, 2021 at 1:36 PM Holden Karau  wrote:
>
>> Hi Y'all,
>>
>> This just recently came up but I'm not super sure on how we want to
>> handle this in general. If code was committed under the lazy consensus
>> model and then a committer or PMC -1s it post merge, what do we want to do?
>>
>> I know we had some previous discussion around -1s, but that was largely
>> focused on pre-commit -1s.
>>
>> Cheers,
>>
>> Holden :)
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Time to start publishing Spark Docker Images?

2021-08-12 Thread Hyukjin Kwon
+1, I think we generally agreed upon having it. Thanks Holden for headsup
and driving this.

+@Dongjoon Hyun  FYI

2021년 7월 22일 (목) 오후 12:22, Kent Yao 님이 작성:

> +1
>
> Bests,
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark .*
> *spark-authorizer A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark .*
> *spark-postgres  A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi A** library t**hat
> brings useful functions from various modern database management systems to 
> **Apache
> Spark .*
>
>
>
> On 07/22/2021 11:13,Holden Karau
>  wrote:
>
> Hi Folks,
>
> Many other distributed computing (https://hub.docker.com/r/rayproject/ray
> https://hub.docker.com/u/daskdev) and ASF projects (
> https://hub.docker.com/u/apache) now publish their images to dockerhub.
>
> We've already got the docker image tooling in place, I think we'd need to
> ask the ASF to grant permissions to the PMC to publish containers and
> update the release steps but I think this could be useful for folks.
>
> Cheers,
>
> Holden
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
> - To
> unsubscribe e-mail: dev-unsubscr...@spark.apache.org


Re: ASF board report draft for August

2021-08-09 Thread Hyukjin Kwon
> Are you referring to what version of Koala project? 1.8.1?

Yes, the latest version 1.8.1.

2021년 8월 10일 (화) 오전 11:07, Igor Costa 님이 작성:

> Hi Matei, nice update
>
>
> Just one question, when you mention “ We are working on Spark 3.2.0 as
> our next release, with a release candidate likely to come soon. Spark 3.2
> includes a new Pandas API for Apache Spark based on the Koalas project”
>
>
> Are you referring to what version of Koala project? 1.8.1?
>
>
>
> Cheers
> Igor
>
> On Tue, 10 Aug 2021 at 13:31, Matei Zaharia 
> wrote:
>
>> It’s time for our quarterly report to the ASF board, which we need to
>> send out this Wednesday. I wrote the draft below based on community
>> activity — let me know if you’d like to add or change anything:
>>
>> ==
>>
>> Description:
>>
>> Apache Spark is a fast and general engine for large-scale data
>> processing. It offers high-level APIs in Java, Scala, Python, R and SQL as
>> well as a rich set of libraries including stream processing, machine
>> learning, and graph analytics.
>>
>> Issues for the board:
>>
>> - None
>>
>> Project status:
>>
>> - We made a number of maintenance releases in the past three months. We
>> released Apache Spark 3.1.2 and 3.0.3 in June as maintenance releases for
>> the 3.x branches. We also released Apache Spark 2.4.8 on May 17 as a bug
>> fix release for the Spark 2.x line. This may be the last release on 2.x
>> unless major new bugs are found.
>>
>> - We added three PMC members: Liang-Chi Hsieh, Kousuke Saruta and Takeshi
>> Yamamuro.
>>
>> - We are working on Spark 3.2.0 as our next release, with a release
>> candidate likely to come soon. Spark 3.2 includes a new Pandas API for
>> Apache Spark based on the Koalas project, a RocksDB state store for
>> Structured Streaming, native support for session windows, error message
>> standardization, and significant improvements to Spark SQL, such as the use
>> of adaptive query execution by default.
>>
>> Trademarks:
>>
>> - No changes since the last report.
>>
>> Latest releases:
>>
>> - Spark 3.1.2 was released on June 23rd, 2021.
>> - Spark 3.0.3 was released on June 1st, 2021.
>> - Spark 2.4.8 was released on May 17th, 2021.
>>
>> Committers and PMC:
>>
>> - The latest committers were added on March 11th, 2021 (Atilla Zsolt
>> Piros, Gabor Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu).
>> - The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> Sent from Gmail Mobile
>


Re: ASF board report draft for August

2021-08-09 Thread Hyukjin Kwon
There is an SPIP passed and ready for Spark 3.2:

pandas API on Spark:
- JIRA: SPIP: Support pandas API layer on PySpark (
https://issues.apache.org/jira/browse/SPARK-34849)
- Vote: [VOTE] SPIP: Support pandas API layer on PySpark (
https://www.mail-archive.com/dev@spark.apache.org/msg27605.html)
- Design documentation: Koalas Internals (
https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU
)


2021년 8월 10일 (화) 오전 10:31, Matei Zaharia 님이 작성:

> It’s time for our quarterly report to the ASF board, which we need to send
> out this Wednesday. I wrote the draft below based on community activity —
> let me know if you’d like to add or change anything:
>
> ==
>
> Description:
>
> Apache Spark is a fast and general engine for large-scale data processing.
> It offers high-level APIs in Java, Scala, Python, R and SQL as well as a
> rich set of libraries including stream processing, machine learning, and
> graph analytics.
>
> Issues for the board:
>
> - None
>
> Project status:
>
> - We made a number of maintenance releases in the past three months. We
> released Apache Spark 3.1.2 and 3.0.3 in June as maintenance releases for
> the 3.x branches. We also released Apache Spark 2.4.8 on May 17 as a bug
> fix release for the Spark 2.x line. This may be the last release on 2.x
> unless major new bugs are found.
>
> - We added three PMC members: Liang-Chi Hsieh, Kousuke Saruta and Takeshi
> Yamamuro.
>
> - We are working on Spark 3.2.0 as our next release, with a release
> candidate likely to come soon. Spark 3.2 includes a new Pandas API for
> Apache Spark based on the Koalas project, a RocksDB state store for
> Structured Streaming, native support for session windows, error message
> standardization, and significant improvements to Spark SQL, such as the use
> of adaptive query execution by default.
>
> Trademarks:
>
> - No changes since the last report.
>
> Latest releases:
>
> - Spark 3.1.2 was released on June 23rd, 2021.
> - Spark 3.0.3 was released on June 1st, 2021.
> - Spark 2.4.8 was released on May 17th, 2021.
>
> Committers and PMC:
>
> - The latest committers were added on March 11th, 2021 (Atilla Zsolt
> Piros, Gabor Somogyi, Kent Yao, Maciej Szymkiewicz, Max Gekk, and Yi Wu).
> - The latest PMC member was added on June 20th, 2021 (Kousuke Saruta).
>
>
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Flaky build in GitHub Actions

2021-07-25 Thread Hyukjin Kwon
This is fixed up via Laingchi's PR:
https://github.com/apache/spark/pull/33447. The issue is almost fixed now
and less flaky.
I'm still interacting w/ GitHub Actions: they are still investigating the
issue. Seems like there's no similar ticket reported so they suspect an
issue specific to Apahc Spark repo.


2021년 7월 22일 (목) 오전 9:40, Hyukjin Kwon 님이 작성:

> FYI, @Liang-Chi Hsieh  is trying to control the memory
> in the test base at https://github.com/apache/spark/pull/33447 which
> looks almost promising now.
> While I don't object to merge things, would need to closely track how
> these tests go at Github Actions in his PR (and in the main Apache repo)
>
> 2021년 7월 22일 (목) 오전 3:00, Holden Karau 님이 작성:
>
>> I noticed that the worker decommissioning suite maybe seems to be running
>> up against the memory limits so I'm going to try and see if I can get our
>> memory usage down a bit as well while we wait for GH response. In the
>> meantime, I'm assuming if things pass Jenkins we are OK with merging yes?
>>
>> On Wed, Jul 21, 2021 at 10:03 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Hyukjin!
>>>
>>> Dongjoon.
>>>
>>> On Tue, Jul 20, 2021 at 8:53 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> I filed a ticket at GitHub. I will share more details when I get a
>>>> response from them.
>>>>
>>>> 2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Looks like there's something going on in the machines in GitHub
>>>>> Actions.
>>>>> The build is now very flaky and keeps dying with symptoms like I guess
>>>>> out-of-memory (?).
>>>>> I will try to take a closer look tomorrow but it would be great if you
>>>>> guys find some time to take a look into it 
>>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>


Re: Flaky build in GitHub Actions

2021-07-21 Thread Hyukjin Kwon
FYI, @Liang-Chi Hsieh  is trying to control the memory in
the test base at https://github.com/apache/spark/pull/33447 which looks
almost promising now.
While I don't object to merge things, would need to closely track how these
tests go at Github Actions in his PR (and in the main Apache repo)

2021년 7월 22일 (목) 오전 3:00, Holden Karau 님이 작성:

> I noticed that the worker decommissioning suite maybe seems to be running
> up against the memory limits so I'm going to try and see if I can get our
> memory usage down a bit as well while we wait for GH response. In the
> meantime, I'm assuming if things pass Jenkins we are OK with merging yes?
>
> On Wed, Jul 21, 2021 at 10:03 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, Hyukjin!
>>
>> Dongjoon.
>>
>> On Tue, Jul 20, 2021 at 8:53 PM Hyukjin Kwon  wrote:
>>
>>> I filed a ticket at GitHub. I will share more details when I get a
>>> response from them.
>>>
>>> 2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:
>>>
>>>> Hi all,
>>>>
>>>> Looks like there's something going on in the machines in GitHub Actions.
>>>> The build is now very flaky and keeps dying with symptoms like I guess
>>>> out-of-memory (?).
>>>> I will try to take a closer look tomorrow but it would be great if you
>>>> guys find some time to take a look into it 
>>>>
>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: Flaky build in GitHub Actions

2021-07-20 Thread Hyukjin Kwon
I filed a ticket at GitHub. I will share more details when I get a response
from them.

2021년 7월 20일 (화) 오후 7:30, Hyukjin Kwon 님이 작성:

> Hi all,
>
> Looks like there's something going on in the machines in GitHub Actions.
> The build is now very flaky and keeps dying with symptoms like I guess
> out-of-memory (?).
> I will try to take a closer look tomorrow but it would be great if you
> guys find some time to take a look into it 
>


Flaky build in GitHub Actions

2021-07-20 Thread Hyukjin Kwon
Hi all,

Looks like there's something going on in the machines in GitHub Actions.
The build is now very flaky and keeps dying with symptoms like I guess
out-of-memory (?).
I will try to take a closer look tomorrow but it would be great if you guys
find some time to take a look into it 


Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-20 Thread Hyukjin Kwon
+1

2021년 6월 21일 (월) 오후 2:19, Dongjoon Hyun 님이 작성:

> +1
>
> Thank you, Yi.
>
> Bests,
> Dongjoon.
>
>
> On Sat, Jun 19, 2021 at 6:57 PM Yuming Wang  wrote:
>
>> +1
>>
>> Tested a batch of production query with Thrift Server.
>>
>> On Sat, Jun 19, 2021 at 3:04 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> +1
>>>
>>> Signatures, digests, etc check out fine.
>>> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Pmesos
>>> -Pkubernetes
>>>
>>> Regards,
>>> Mridul
>>>
>>> PS: Might be related to some quirk of my local env - the first test run
>>> (after clean + package) usually fails for me (typically for hive tests) -
>>> with a second run succeeding : this is not specific to this RC though.
>>>
>>> On Fri, Jun 18, 2021 at 6:14 PM Liang-Chi Hsieh 
>>> wrote:
>>>
 +1. Docs looks good. Binary looks good.

 Ran simple test and some tpcds queries.

 Thanks for working on this!


 wuyi wrote
 > Please vote on releasing the following candidate as Apache Spark
 version
 > 3.0.3.
 >
 > The vote is open until Jun 21th 3AM (PST) and passes if a majority +1
 PMC
 > votes are cast, with
 > a minimum of 3 +1 votes.
 >
 > [ ] +1 Release this package as Apache Spark 3.0.3
 > [ ] -1 Do not release this package because ...
 >
 > To learn more about Apache Spark, please see
 https://spark.apache.org/
 >
 > The tag to be voted on is v3.0.3-rc1 (commit
 > 65ac1e75dc468f53fc778cd2ce1ba3f21067aab8):
 > https://github.com/apache/spark/tree/v3.0.3-rc1
 >
 > The release files, including signatures, digests, etc. can be found
 at:
 > https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-bin/
 >
 > Signatures used for Spark RCs can be found in this file:
 > https://dist.apache.org/repos/dist/dev/spark/KEYS
 >
 > The staging repository for this release can be found at:
 >
 https://repository.apache.org/content/repositories/orgapachespark-1386/
 >
 > The documentation corresponding to this release can be found at:
 > https://dist.apache.org/repos/dist/dev/spark/v3.0.3-rc1-docs/
 >
 > The list of bug fixes going into 3.0.3 can be found at the following
 URL:
 > https://issues.apache.org/jira/projects/SPARK/versions/12349723
 >
 > This release is using the release script of the tag v3.0.3-rc1.
 >
 > FAQ
 >
 > =
 > How can I help test this release?
 > =
 >
 > If you are a Spark user, you can help us test this release by taking
 > an existing Spark workload and running on this release candidate, then
 > reporting any regressions.
 >
 > If you're working in PySpark you can set up a virtual env and install
 > the current RC and see if anything important breaks, in the Java/Scala
 > you can add the staging repository to your projects resolvers and test
 > with the RC (make sure to clean up the artifact cache before/after so
 > you don't end up building with a out of date RC going forward).
 >
 > ===
 > What should happen to JIRA tickets still targeting 3.0.3?
 > ===
 >
 > The current list of open tickets targeted at 3.0.3 can be found at:
 > https://issues.apache.org/jira/projects/SPARK and search for "Target
 > Version/s" = 3.0.3
 >
 > Committers should look at those and triage. Extremely important bug
 > fixes, documentation, and API tweaks that impact compatibility should
 > be worked on immediately. Everything else please retarget to an
 > appropriate release.
 >
 > ==
 > But my bug isn't fixed?
 > ==
 >
 > In order to make timely releases, we will typically not hold the
 > release unless the bug in question is a regression from the previous
 > release. That being said, if there is something which is a regression
 > that has not been correctly targeted please ping me or a committer to
 > help target the issue.





 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: dev-unsubscr...@spark.apache.org




Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Hyukjin Kwon
*GA -> QA

On Thu, 17 Jun 2021, 15:16 Hyukjin Kwon,  wrote:

> I think we would make sure treating these items in the list as exceptions
> from the code freeze, and discourage to push new APIs and features though.
>
> GA period ideally we should focus on bug fixes and polishing.
>
> It would be great if we can speed up on these items in the list too.
>
>
> On Thu, 17 Jun 2021, 15:08 Gengliang Wang,  wrote:
>
>> Thanks for the suggestions from Dongjoon, Liangchi, Min, and Xiao!
>> Now we make it clear that it's a soft cut and we can still merge
>> important code changes to branch-3.2 before RC. Let's keep the branch cut
>> date as July 1st.
>>
>> On Thu, Jun 17, 2021 at 1:41 PM Dongjoon Hyun 
>> wrote:
>>
>>> > First, I think you are saying "branch-3.2";
>>>
>>> To Xiao. Yes, it's was a typo of "branch-3.2".
>>>
>>> > We do strongly prefer to cut the release for Spark 3.2.0 including
>>> all the patches under SPARK-30602.
>>> > This way, we can backport the other performance/operability
>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>> future Spark 3.2.x patch releases.
>>>
>>> To Min, after releasing 3.2.0, only bug fixes are allowed for 3.2.1+ as
>>> Xiao wrote.
>>>
>>>
>>>
>>> On Wed, Jun 16, 2021 at 9:42 PM Xiao Li  wrote:
>>>
>>>> To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>>>>> soft cut and the committers still are able to commit to `branch-3.3`
>>>>> according to their decisions.
>>>>
>>>>
>>>> First, I think you are saying "branch-3.2";
>>>>
>>>> Second, the "so cut" means no "code freeze", although we cut the
>>>> branch. To avoid releasing half-baked and unready features, the release
>>>> manager needs to be very careful when cutting the RC. Based on what is
>>>> proposed here, the RC date is the actual code freeze date.
>>>>
>>>> This way, we can backport the other performance/operability
>>>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>>>> future Spark 3.2.x patch releases.
>>>>
>>>>
>>>> This is not allowed based on the policy. Only bug fixes can be merged
>>>> to the patch releases. Thus, if we know it will introduce major performance
>>>> regression, we have to turn the feature off by default.
>>>>
>>>> Xiao
>>>>
>>>>
>>>>
>>>> Min Shen  于2021年6月16日周三 下午3:22写道:
>>>>
>>>>> Hi Gengliang,
>>>>>
>>>>> Thanks for volunteering as the release manager for Spark 3.2.0.
>>>>> Regarding the ongoing work of push-based shuffle in SPARK-30602, we
>>>>> are close to having all the patches merged to master to enable push-based
>>>>> shuffle.
>>>>> Currently, there are 2 PRs under SPARK-30602 that are under active
>>>>> review (SPARK-32922 and SPARK-35671), and hopefully can be merged soon.
>>>>> We should be able to post the PRs for the other 2 remaining tickets
>>>>> (SPARK-32923 and SPARK-35546) early next week.
>>>>>
>>>>> The tickets under SPARK-30602 are the minimum set of patches to enable
>>>>> push-based shuffle.
>>>>> We do have other performance/operability enhancements tickets under
>>>>> SPARK-33235 that are needed to fully contribute what we have internally 
>>>>> for
>>>>> push-based shuffle.
>>>>> However, these are optional for enabling push-based shuffle.
>>>>> We do strongly prefer to cut the release for Spark 3.2.0 including all
>>>>> the patches under SPARK-30602.
>>>>> This way, we can backport the other performance/operability
>>>>> enhancements tickets under SPARK-33235 into branch-3.2 to be released in
>>>>> future Spark 3.2.x patch releases.
>>>>> I understand the preference of not postponing the branch cut date.
>>>>> We will check with Dongjoon regarding the soft cut date and the
>>>>> flexibility for including the remaining tickets under SPARK-30602 into
>>>>> branch-3.2.
>>>>>
>>>>> Best,
>>>>> Min
>>>>>
>>>>> On Wed, Jun 16, 2021 at 1:20 PM Liang-Chi Hsieh 
>>>>> wrote:
>>>

Re: Apache Spark 3.2 Expectation

2021-06-17 Thread Hyukjin Kwon
gt;> > Thank you for volunteering, Gengliang.
>>>>> >
>>>>> > Apache Spark 3.2.0 is the first version enabling AQE by default. I'm
>>>>> also
>>>>> > watching some on-going improvements on that.
>>>>> >
>>>>> > https://issues.apache.org/jira/browse/SPARK-33828 (SQL Adaptive
>>>>> Query
>>>>> > Execution QA)
>>>>> >
>>>>> > To Liang-Chi, I'm -1 for postponing the branch cut because this is a
>>>>> soft
>>>>> > cut and the committers still are able to commit to `branch-3.3`
>>>>> according
>>>>> > to their decisions.
>>>>> >
>>>>> > Given that Apache Spark had 115 commits in a week in various areas
>>>>> > concurrently, we should start QA for Apache Spark 3.2 by creating
>>>>> > branch-3.3 and allowing only limited backporting.
>>>>> >
>>>>> > https://github.com/apache/spark/graphs/commit-activity
>>>>> >
>>>>> > Bests,
>>>>> > Dongjoon.
>>>>> >
>>>>> >
>>>>> > On Wed, Jun 16, 2021 at 9:19 AM Liang-Chi Hsieh 
>>>>>
>>>>> > viirya@
>>>>>
>>>>> >  wrote:
>>>>> >
>>>>> >> First, thanks for being volunteer as the release manager of Spark
>>>>> 3.2.0,
>>>>> >> Gengliang!
>>>>> >>
>>>>> >> And yes, for the two important Structured Streaming features,
>>>>> RocksDB
>>>>> >> StateStore and session window, we're working on them and expect to
>>>>> have
>>>>> >> them
>>>>> >> in the new release.
>>>>> >>
>>>>> >> So I propose to postpone the branch cut date.
>>>>> >>
>>>>> >> Thank you!
>>>>> >>
>>>>> >> Liang-Chi
>>>>> >>
>>>>> >>
>>>>> >> Gengliang Wang-2 wrote
>>>>> >> > Thanks, Hyukjin.
>>>>> >> >
>>>>> >> > The expected target branch cut date of Spark 3.2 is *July 1st* on
>>>>> >> > https://spark.apache.org/versioning-policy.html. However, I
>>>>> notice that
>>>>> >> > there are still multiple important projects in progress now:
>>>>> >> >
>>>>> >> > [Core]
>>>>> >> >
>>>>> >> >- SPIP: Support push-based shuffle to improve shuffle
>>>>> efficiency
>>>>> >> >https://issues.apache.org/jira/browse/SPARK-30602;
>>>>> >> >
>>>>> >> > [SQL]
>>>>> >> >
>>>>> >> >- Support ANSI SQL INTERVAL types
>>>>> >> >https://issues.apache.org/jira/browse/SPARK-27790;
>>>>> >> >- Support Timestamp without time zone data type
>>>>> >> >https://issues.apache.org/jira/browse/SPARK-35662;
>>>>> >> >- Aggregate (Min/Max/Count) push down for Parquet
>>>>> >> >https://issues.apache.org/jira/browse/SPARK-34952;
>>>>> >> >
>>>>> >> > [Streaming]
>>>>> >> >
>>>>> >> >- EventTime based sessionization (session window)
>>>>> >> >https://issues.apache.org/jira/browse/SPARK-10816;
>>>>> >> >- Add RocksDB StateStore as external module
>>>>> >> >https://issues.apache.org/jira/browse/SPARK-34198;
>>>>> >> >
>>>>> >> >
>>>>> >> > I wonder whether we should postpone the branch cut date.
>>>>> >> > cc Min Shen, Yi Wu, Max Gekk, Huaxin Gao, Jungtaek Lim, Yuanjian
>>>>> >> > Li, Liang-Chi Hsieh, who work on the projects above.
>>>>> >> >
>>>>> >> > On Tue, Jun 15, 2021 at 4:34 PM Hyukjin Kwon 
>>>>> >>
>>>>> >> > gurwls223@
>>>>> >>
>>>>> >> >  wrote:
>>>>> >> >
>>>>> >> >> +1, thanks.
>>>>> >> >>
>>>>> >> >> On Tue, 15 Jun 2021, 16:17 Gengliang Wang, 
>>>>> >>
>>>>> >> > ltnwgl@
>>>>> >>
>>>>> >> >  wrote:
>>>>> >> >>
>>>>> >> >>> Hi,
>>>>> >> >>>
>>>>> >> >>> As the expected release date is close,  I would like to
>>>>> volunteer as
>>>>> >> the
>>>>> >> >>> release manager for Apache Spark 3.2.0.
>>>>> >> >>>
>>>>> >> >>> Thanks,
>>>>> >> >>> Gengliang
>>>>> >> >>>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >> --
>>>>> >> Sent from:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>> >>
>>>>> >>
>>>>> -
>>>>> >> To unsubscribe e-mail:
>>>>>
>>>>> > dev-unsubscribe@.apache
>>>>>
>>>>> >>
>>>>> >>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>


Re: Apache Spark 3.2 Expectation

2021-06-15 Thread Hyukjin Kwon
+1, thanks.

On Tue, 15 Jun 2021, 16:17 Gengliang Wang,  wrote:

> Hi,
>
> As the expected release date is close,  I would like to volunteer as the
> release manager for Apache Spark 3.2.0.
>
> Thanks,
> Gengliang
>
> On Mon, Apr 12, 2021 at 1:59 PM Wenchen Fan  wrote:
>
>> An update: we found a mistake that we picked the Spark 3.2 release date
>> based on the scheduled release date of 3.1. However, 3.1 was delayed and
>> released on March 2. In order to have a full 6 months development for 3.2,
>> the target release date for 3.2 should be September 2.
>>
>> I'm updating the release dates in
>> https://github.com/apache/spark-website/pull/331
>>
>> Thanks,
>> Wenchen
>>
>> On Thu, Mar 11, 2021 at 11:17 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Xiao, Wenchen and Hyukjin.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Thu, Mar 11, 2021 at 2:15 AM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Just for an update, I will send a discussion email about my idea late
>>>> this week or early next week.
>>>>
>>>> 2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:
>>>>
>>>>> There are many projects going on right now, such as new DS v2 APIs,
>>>>> ANSI interval types, join improvement, disaggregated shuffle, etc. I don't
>>>>> think it's realistic to do the branch cut in April.
>>>>>
>>>>> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut
>>>>> the branch 3 months earlier. We should make the release process faster and
>>>>> cut the branch around June probably.
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>>>>>
>>>>>> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
>>>>>> Join support <https://issues.apache.org/jira/browse/SPARK-28379>,
>>>>>> interval data type, timestamp without time zone, un-nesting arbitrary
>>>>>> queries, the returned metrics of DSV2, and error message standardization.
>>>>>> Spark 3.2 will be another exciting release I believe!
>>>>>>
>>>>>> Go Spark!
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道:
>>>>>>
>>>>>>> Hi, Xiao.
>>>>>>>
>>>>>>> This thread started 13 days ago. Since you asked the community about
>>>>>>> major features or timelines at that time, could you share your roadmap 
>>>>>>> or
>>>>>>> expectations if you have something in your mind?
>>>>>>>
>>>>>>> > Thank you, Dongjoon, for initiating this discussion. Let us keep
>>>>>>> it open. It might take 1-2 weeks to collect from the community all the
>>>>>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>>>>>> voting.
>>>>>>> > TBH, cutting the branch this April does not look good to me. That
>>>>>>> means, we only have one month left for feature development of Spark 
>>>>>>> 3.2. Do
>>>>>>> we have enough features in the current master branch? If not, are we 
>>>>>>> able
>>>>>>> to finish major features we collected here? Do they have a timeline or
>>>>>>> project plan?
>>>>>>>
>>>>>>> Bests,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, John.
>>>>>>>>
>>>>>>>> This thread aims to share your expectations and goals (and maybe
>>>>>>>> work progress) to Apache Spark 3.2 because we are making this 
>>>>>>>> together. :)
>>>>>>>>
>>>>>>>> Bests,
>>>>>>>> Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Mar 3, 2021 at 1:59 PM John Zhuge 
>>>>>>&

Re: Apache Spark 3.0.3 Release?

2021-06-08 Thread Hyukjin Kwon
Yeah, +1

2021년 6월 9일 (수) 오후 12:06, Yi Wu 님이 작성:

> Hi, All.
>
> Since Apache Spark 3.0.2 tag creation (Feb 16),
> new 119 patches (92 issues
> 
> resolved) arrived at branch-3.0.
>
> Shall we make a new release, Apache Spark 3.0.3, as the 3rd release at
> the 3.0 line?
> I'd like to volunteer as the release manager for Apache Spark 3.0.3.
> I'm thinking about starting the first RC at the end of this week.
>
> $ git log --oneline v3.0.2..HEAD | wc -l
>  119
>
> # Known correctness issues
> SPARK-34534  New
> protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or
> correctness
> SPARK-34545 
> PySpark Python UDF return inconsistent results when applying 2 UDFs with
> different return type to 2 columns together
> SPARK-34719  fail
> if the view query has duplicated column names
> SPARK-34794 Nested
> higher-order functions broken in DSL
>
> # Notable user-facing changes
> SPARK-32924  Web
> UI sort on duration is wrong
> SPARK-35405 
>  Submitting Applications documentation has outdated information about K8s
> client mode support
>
> Thanks,
> Yi
>


Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Hyukjin Kwon
awesome!

2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성:

> We are happy to announce the availability of Spark 3.1.2!
>
> Spark 3.1.2 is a maintenance release containing stability fixes. This
> release is based on the branch-3.1 maintenance branch of Spark. We strongly
> recommend all 3.1 users to upgrade to this stable release.
>
> To download Spark 3.1.2, head over to the download page:
> https://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-1-2.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
> Dongjoon Hyun
>


Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread Hyukjin Kwon
+1

2021년 5월 26일 (수) 오전 9:00, Cheng Su 님이 작성:

> +1 (non-binding)
>
>
>
> Checked the related commits in commit history manually.
>
>
>
> Thanks!
>
> Cheng Su
>
>
>
> *From: *Takeshi Yamamuro 
> *Date: *Tuesday, May 25, 2021 at 4:47 PM
> *To: *Dongjoon Hyun , dev 
> *Subject: *Re: [VOTE] Release Spark 3.1.2 (RC1)
>
>
>
> +1 (non-binding)
>
>
>
> I ran the tests, checked the related jira tickets, and compared TPCDS
> performance differences between
>
> this v3.1.2 candidate and v3.1.1.
>
> Everything looks fine.
>
>
>
> Thank you, Dongjoon!
>
>
>
>
>
> On Wed, May 26, 2021 at 2:32 AM Gengliang Wang  wrote:
>
> SGTM. Thanks for the work!
>
>
>
> +1 (non-binding)
>
>
>
> On Wed, May 26, 2021 at 1:28 AM Dongjoon Hyun 
> wrote:
>
> Thank you, Sean and Gengliang.
>
>
>
> To Gengliang, it looks not that serious to me because that's a doc-only
> issue which also can be mitigated simply by updating `facetFilters` from
> htmls after release.
>
>
>
> Bests,
>
> Dongjoon.
>
>
>
>
>
> On Tue, May 25, 2021 at 9:45 AM Gengliang Wang  wrote:
>
> Hi Dongjoon,
>
>
>
> After Spark 3.1.1, we need an extra step for updating the DocSearch
> version index in the release process. I didn't expect Spark 3.1.2 to come
> at this time so I haven't updated the release process
>  until yesterday.
>
> I think we should use the latest branch-3.1 to regenerate the Spark
> documentation. See https://github.com/apache/spark/pull/32654 for
> details. I have also enhanced the release process script
>  for this.
>
>
>
> Thanks
>
> Gengliang
>
>
>
>
>
>
>
>
>
> On Tue, May 25, 2021 at 11:31 PM Sean Owen  wrote:
>
> +1 same result as in previous tests
>
>
>
> On Mon, May 24, 2021 at 1:14 AM Dongjoon Hyun 
> wrote:
>
> Please vote on releasing the following candidate as Apache Spark version
> 3.1.2.
>
> The vote is open until May 27th 1AM (PST) and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.1.2
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v3.1.2-rc1 (commit
> de351e30a90dd988b133b3d00fa6218bfcaba8b8):
> https://github.com/apache/spark/tree/v3.1.2-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1384/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-docs/
>
> The list of bug fixes going into 3.1.2 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12349602
>
> This release is using the release script of the tag v3.1.2-rc1.
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.1.2?
> ===
>
> The current list of open tickets targeted at 3.1.2 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.1.2
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
>
> --
>
> ---
> Takeshi Yamamuro
>


Re: Resolves too old JIRAs as incomplete

2021-05-24 Thread Hyukjin Kwon
Awesome, thanks Takeshi!

2021년 5월 25일 (화) 오전 10:59, Takeshi Yamamuro 님이 작성:

> FYI:
>
> Thank you for all the comments.
> I closed 754 tickets in bulk a few minutes ago.
> Please let me know if there is any problem.
>
> Bests,
> Takeshi
>
> On Fri, May 21, 2021 at 10:29 AM Kent Yao  wrote:
>
>> +1,thanks Takeshi
>>
>> *Kent Yao *
>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> *a spark enthusiast*
>> *kyuubi is a
>> unified multi-tenant JDBC interface for large-scale data processing and
>> analytics, built on top of Apache Spark .*
>> *spark-authorizer A Spark
>> SQL extension which provides SQL Standard Authorization for **Apache
>> Spark .*
>> *spark-postgres  A library
>> for reading data from and transferring data to Postgres / Greenplum with
>> Spark SQL and DataFrames, 10~100x faster.*
>> *itatchi A** library t**hat
>> brings useful functions from various modern database management systems
>> to​ **Apache Spark .*
>>
>>
>> On 05/21/2021 07:12, Takeshi Yamamuro  wrote:
>> Thank you, all~
>>
>> okay, so I will close them in bulk next week.
>> If you have more comments, please let me know here.
>>
>> Bests,
>> Takeshi
>>
>> On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
>> wrote:
>>
>>> +1, thanks Takeshi !
>>>
>>> Regards,
>>> Mridul
>>>
>>> On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
>>> wrote:
>>>
 Hi, dev,

 As you know, we have too many open JIRAs now:
 # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
 Progress", Reopened)'

 We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
 JIRAs
 for making the JIRAs manageable.

 As Hyukjin did the same action two years ago (for details, see:
 http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
 I'm planning to use a similar JQL below to close them:

 project = SPARK AND status in (Open, "In Progress", Reopened) AND
 (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
 AND updated <= -52w

 The total number of matched JIRAs is 741.
 Or, we might be able to close them more aggressively by removing the
 version condition:

 project = SPARK AND status in (Open, "In Progress", Reopened) AND
 updated <= -52w

 The matched number is 1484 (almost half of the current open JIRAs).

 If there is no objection, I'd like to do it next week or later.
 Any thoughts?

 Bests,
 Takeshi
 --
 ---
 Takeshi Yamamuro

>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>>
>
> --
> ---
> Takeshi Yamamuro
>


Re: Resolves too old JIRAs as incomplete

2021-05-19 Thread Hyukjin Kwon
Yeah, I wanted to discuss this. I agree since 2.4.x became EOL

2021년 5월 20일 (목) 오전 10:54, Sean Owen 님이 작성:

> I agree. Such old JIRAs are 99% obsolete. If anyone objects to a
> particular issue being closed, they can comment and we can reopen. It's a
> very reversible thing. There is value in keeping JIRA up to date with
> reality.
>
> On Wed, May 19, 2021 at 8:47 PM Takeshi Yamamuro 
> wrote:
>
>> Hi, dev,
>>
>> As you know, we have too many open JIRAs now:
>> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
>> Progress", Reopened)'
>>
>> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
>> JIRAs
>> for making the JIRAs manageable.
>>
>> As Hyukjin did the same action two years ago (for details, see:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
>> I'm planning to use a similar JQL below to close them:
>>
>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
>> AND updated <= -52w
>>
>> The total number of matched JIRAs is 741.
>> Or, we might be able to close them more aggressively by removing the
>> version condition:
>>
>> project = SPARK AND status in (Open, "In Progress", Reopened) AND updated
>> <= -52w
>>
>> The matched number is 1484 (almost half of the current open JIRAs).
>>
>> If there is no objection, I'd like to do it next week or later.
>> Any thoughts?
>>
>> Bests,
>> Takeshi
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-17 Thread Hyukjin Kwon
Yay!

2021년 5월 18일 (화) 오후 12:57, Liang-Chi Hsieh 님이 작성:

> We are happy to announce the availability of Spark 2.4.8!
>
> Spark 2.4.8 is a maintenance release containing stability, correctness, and
> security fixes.
> This release is based on the branch-2.4 maintenance branch of Spark. We
> strongly recommend all 2.4 users to upgrade to this stable release.
>
> To download Spark 2.4.8, head over to the download page:
> http://spark.apache.org/downloads.html
>
> Note that you might need to clear your browser cache or to use
> `Private`/`Incognito` mode according to your browsers.
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-2-4-8.html
>
> We would like to acknowledge all community members for contributing to this
> release. This release would not have been possible without you.
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Hyukjin Kwon
+1 thanks for driving me

On Tue, 18 May 2021, 09:33 Holden Karau,  wrote:

> +1 and thanks for volunteering to be the RM :)
>
> On Mon, May 17, 2021 at 4:09 PM Takeshi Yamamuro 
> wrote:
>
>> Thank you, Dongjoon~ sgtm, too.
>>
>> On Tue, May 18, 2021 at 7:34 AM Cheng Su  wrote:
>>
>>> +1 for a new release, thanks Dongjoon!
>>>
>>> Cheng Su
>>>
>>> On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:
>>>
>>> +1 sounds good. Thanks Dongjoon for volunteering on this!
>>>
>>>
>>> Liang-Chi
>>>
>>>
>>> Dongjoon Hyun-2 wrote
>>> > Hi, All.
>>> >
>>> > Since Apache Spark 3.1.1 tag creation (Feb 21),
>>> > new 172 patches including 9 correctness patches and 4 K8s patches
>>> arrived
>>> > at branch-3.1.
>>> >
>>> > Shall we make a new release, Apache Spark 3.1.2, as the second
>>> release at
>>> > 3.1 line?
>>> > I'd like to volunteer for the release manager for Apache Spark
>>> 3.1.2.
>>> > I'm thinking about starting the first RC next week.
>>> >
>>> > $ git log --oneline v3.1.1..HEAD | wc -l
>>> >  172
>>> >
>>> > # Known correctness issues
>>> > SPARK-34534 New protocol FetchShuffleBlocks in
>>> OneForOneBlockFetcher
>>> > lead to data loss or correctness
>>> > SPARK-34545 PySpark Python UDF return inconsistent results when
>>> > applying 2 UDFs with different return type to 2 columns together
>>> > SPARK-34681 Full outer shuffled hash join when building left
>>> side
>>> > produces wrong result
>>> > SPARK-34719 fail if the view query has duplicated column names
>>> > SPARK-34794 Nested higher-order functions broken in DSL
>>> > SPARK-34829 transform_values return identical values when it's
>>> used
>>> > with udf that returns reference type
>>> > SPARK-34833 Apply right-padding correctly for correlated
>>> subqueries
>>> > SPARK-35381 Fix lambda variable name issues in nested DataFrame
>>> > functions in R APIs
>>> > SPARK-35382 Fix lambda variable name issues in nested DataFrame
>>> > functions in Python APIs
>>> >
>>> > # Notable K8s patches since K8s GA
>>> > SPARK-34674Close SparkContext after the Main method has
>>> finished
>>> > SPARK-34948Add ownerReference to executor configmap to fix
>>> leakages
>>> > SPARK-34820add apt-update before gnupg install
>>> > SPARK-34361In case of downscaling avoid killing of executors
>>> already
>>> > known by the scheduler backend in the pod allocator
>>> >
>>> > Bests,
>>> > Dongjoon.
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Sent from:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-10 Thread Hyukjin Kwon
+1

2021년 5월 10일 (월) 오후 4:45, John Zhuge 님이 작성:

> No, just try to build a Java project with Maven RC repo.
>
> Validated checksum and signature; ran RAT checks; built the source and ran
> unit tests.
>
> +1 (non-binding)
>
> On Sun, May 9, 2021 at 11:10 PM Liang-Chi Hsieh  wrote:
>
>> Yea, I don't know why it happens.
>>
>> I remember RC1 also has the same issue. But RC2 and RC3 don't.
>>
>> Does it affect the RC?
>>
>>
>> John Zhuge wrote
>> > Got this error when browsing the staging repository:
>> >
>> > 404 - Repository "orgapachespark-1383 (staging: open)"
>> > [id=orgapachespark-1383] exists but is not exposed.
>> >
>> > John Zhuge
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>
> --
> John Zhuge
>


Re: [VOTE] Release Spark 2.4.8 (RC3)

2021-04-28 Thread Hyukjin Kwon
+1

On Thu, 29 Apr 2021, 07:08 Sean Owen,  wrote:

> +1 from me too, same result as last time.
>
> On Wed, Apr 28, 2021 at 11:33 AM Liang-Chi Hsieh  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.8.
>>
>> The vote is open until May 4th at 9AM PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.8
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no issues targeting 2.4.8 (try project = SPARK AND
>> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In
>> Progress"))
>>
>> The tag to be voted on is v2.4.8-rc3 (commit
>> e89526d2401b3a04719721c923a6f630e555e286):
>> https://github.com/apache/spark/tree/v2.4.8-rc3
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1377/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc3-docs/
>>
>> The list of bug fixes going into 2.4.8 can be found at the following URL:
>> https://s.apache.org/spark-v2.4.8-rc3
>>
>> This release is using the release script of the tag v2.4.8-rc3.
>>
>> FAQ
>>
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.8?
>> ===
>>
>> The current list of open tickets targeted at 2.4.8 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.8
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-14 Thread Hyukjin Kwon
I remember it's turned on by default (?). If not, yeah we should document.

2021년 4월 15일 (목) 오후 1:14, Kent Yao 님이 작성:

> Thanks Hyukjin and Yikun,
>
> > 2. New Forks have to turn on GitHub action by the fork owner manually
>
> And we may still need a suitable place to make this note clearer to new
> contributors or someone delete and re-fork their forked repo.
>
> Thanks
>
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark <http://spark.apache.org/>.*
> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark <http://spark.apache.org/>.*
> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
> brings useful functions from various modern database management systems to 
> **Apache
> Spark <http://spark.apache.org/>.*
>
>
>
> On 04/15/2021 12:09,Hyukjin Kwon
>  wrote:
>
> The issue is fixed now. Please keep monitoring this. Thank you all! The
> spark community is super active and cooperative!
>
> 2021년 4월 15일 (목) 오전 11:01, Hyukjin Kwon 님이 작성:
>
>> The fix will be straightforward. We can either, in Github Actions
>> workflow,:
>> - remove fast forward option and see if ti works
>> - or git rebase before merge the branch
>>
>> 2021년 4월 15일 (목) 오전 11:00, Hyukjin Kwon 님이 작성:
>>
>>> I think it works mostly correctly as Dongjoon investigated and shared
>>> (Thanks a lot!).
>>> One problem seems to be syncing to the master seems too strict (
>>> https://github.com/apache/spark/pull/32168#issuecomment-819736508).
>>> Thanks Yikun.
>>> I think we should make it less strict. I can create a PR right away but
>>> would like to encourage Yikun or Kent to do it in order to keep the credits
>>> of their investigation.
>>>
>>> 2021년 4월 15일 (목) 오전 7:21, Dongjoon Hyun 님이 작성:
>>>
>>>> Hi, Kent.
>>>>
>>>> I checked (1) in your PR, but those test result comments look correct
>>>> to me.
>>>> Please note that both Jenkins and GitHub Action leave the same number
>>>> of comments on the same GitHash.
>>>> Given that, there are not fake comments. It looks like a real result of
>>>> your commits on that PR.
>>>>
>>>> GitHash: 23248c3
>>>>  https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>>> (GitHub Action)
>>>>  https://github.com/apache/spark/pull/32144#issuecomment-819647368
>>>> (Jenkins)
>>>>
>>>> GitHash: 8dbed7b
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819684782
>>>> (GitHub Action)
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819578976
>>>> (Jenkins)
>>>>
>>>> GitHash: a3a6c5e
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819690465
>>>> (GitHub Action)
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819793557
>>>> (Jenkins)
>>>>
>>>> GitHash: b6d26b7
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819691416
>>>> (GitHub Action)
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819791485
>>>> (Jenkins)
>>>>
>>>> Could you recheck it?
>>>>
>>>>
>>>> 1. Github-actions notification could be wrong when another PR opened
>>>>> with some same commits, and you will get a lot of fake comments then.
>>>>> Meanwhile, the new PR get no comments, even if it is actually the
>>>>> chosen one.
>>>>>1.1
>>>>> https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>>>>
>>>>
>>>>
>>>> On Wed, Apr 14, 2021 at 10:41 AM Kent Yao  wrote:
>>>>
>>>>> Hi ALL, here is something I notice after this change:
>>>>>
>>>>> 1. Github-actions notification could be wrong when another PR opened
>>>>> with some same commits, and you will get a lot of f

Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-14 Thread Hyukjin Kwon
The issue is fixed now. Please keep monitoring this. Thank you all! The
spark community is super active and cooperative!

2021년 4월 15일 (목) 오전 11:01, Hyukjin Kwon 님이 작성:

> The fix will be straightforward. We can either, in Github Actions
> workflow,:
> - remove fast forward option and see if ti works
> - or git rebase before merge the branch
>
> 2021년 4월 15일 (목) 오전 11:00, Hyukjin Kwon 님이 작성:
>
>> I think it works mostly correctly as Dongjoon investigated and shared
>> (Thanks a lot!).
>> One problem seems to be syncing to the master seems too strict (
>> https://github.com/apache/spark/pull/32168#issuecomment-819736508).
>> Thanks Yikun.
>> I think we should make it less strict. I can create a PR right away but
>> would like to encourage Yikun or Kent to do it in order to keep the credits
>> of their investigation.
>>
>> 2021년 4월 15일 (목) 오전 7:21, Dongjoon Hyun 님이 작성:
>>
>>> Hi, Kent.
>>>
>>> I checked (1) in your PR, but those test result comments look correct to
>>> me.
>>> Please note that both Jenkins and GitHub Action leave the same number of
>>> comments on the same GitHash.
>>> Given that, there are not fake comments. It looks like a real result of
>>> your commits on that PR.
>>>
>>> GitHash: 23248c3
>>>  https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>> (GitHub Action)
>>>  https://github.com/apache/spark/pull/32144#issuecomment-819647368
>>> (Jenkins)
>>>
>>> GitHash: 8dbed7b
>>> https://github.com/apache/spark/pull/32144#issuecomment-819684782
>>> (GitHub Action)
>>> https://github.com/apache/spark/pull/32144#issuecomment-819578976
>>> (Jenkins)
>>>
>>> GitHash: a3a6c5e
>>> https://github.com/apache/spark/pull/32144#issuecomment-819690465
>>> (GitHub Action)
>>> https://github.com/apache/spark/pull/32144#issuecomment-819793557
>>> (Jenkins)
>>>
>>> GitHash: b6d26b7
>>> https://github.com/apache/spark/pull/32144#issuecomment-819691416
>>> (GitHub Action)
>>> https://github.com/apache/spark/pull/32144#issuecomment-819791485
>>> (Jenkins)
>>>
>>> Could you recheck it?
>>>
>>>
>>> 1. Github-actions notification could be wrong when another PR opened
>>>> with some same commits, and you will get a lot of fake comments then.
>>>> Meanwhile, the new PR get no comments, even if it is actually the
>>>> chosen one.
>>>>1.1
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>>>
>>>
>>>
>>> On Wed, Apr 14, 2021 at 10:41 AM Kent Yao  wrote:
>>>
>>>> Hi ALL, here is something I notice after this change:
>>>>
>>>> 1. Github-actions notification could be wrong when another PR opened
>>>> with some same commits, and you will get a lot of fake comments then.
>>>> Meanwhile, the new PR get no comments, even if it is actually the
>>>> chosen one.
>>>>1.1
>>>> https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>>> 2. New Forks have to turn on GitHub action by the fork owner manually
>>>> 3. `Notify test workflow` keeps waiting when the build flow canceled
>>>> or the whole fork gone
>>>> 4. After refreshed master or even re-forked :(, I still got failures
>>>> and seems not alone
>>>>4.1. https://github.com/apache/spark/pull/32168 (PR after sync)
>>>>4.2. https://github.com/apache/spark/pull/32172 (PR after re-forked)
>>>>4.3.
>>>> https://github.com/attilapiros/spark/runs/2344911058?check_suite_focus=true
>>>> (some other failures noticed)
>>>>
>>>>
>>>> Bests,
>>>>
>>>> Kent
>>>>
>>>> Dongjoon Hyun  于2021年4月14日周三 下午11:34写道:
>>>> >
>>>> > Thank you again, Hyukjin.
>>>> >
>>>> > Bests,
>>>> > Dongjoon.
>>>> >
>>>> > On Wed, Apr 14, 2021 at 5:25 AM Kent Yao  wrote:
>>>> >>
>>>> >> Cool, thanks!
>>>> >>
>>>> >> Hyukjin Kwon  于2021年4月14日周三 下午8:19写道:
>>>> >>>
>>>> >>> Good point! I had to clarify.
>>>> >>> Once is enough. The sync is needed for your branch to include the
>>>> changes of https://github.com/apache/spark/pull/32092.
>>>

Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-14 Thread Hyukjin Kwon
The fix will be straightforward. We can either, in Github Actions workflow,:
- remove fast forward option and see if ti works
- or git rebase before merge the branch

2021년 4월 15일 (목) 오전 11:00, Hyukjin Kwon 님이 작성:

> I think it works mostly correctly as Dongjoon investigated and shared
> (Thanks a lot!).
> One problem seems to be syncing to the master seems too strict (
> https://github.com/apache/spark/pull/32168#issuecomment-819736508).
> Thanks Yikun.
> I think we should make it less strict. I can create a PR right away but
> would like to encourage Yikun or Kent to do it in order to keep the credits
> of their investigation.
>
> 2021년 4월 15일 (목) 오전 7:21, Dongjoon Hyun 님이 작성:
>
>> Hi, Kent.
>>
>> I checked (1) in your PR, but those test result comments look correct to
>> me.
>> Please note that both Jenkins and GitHub Action leave the same number of
>> comments on the same GitHash.
>> Given that, there are not fake comments. It looks like a real result of
>> your commits on that PR.
>>
>> GitHash: 23248c3
>>  https://github.com/apache/spark/pull/32144#issuecomment-819679970
>> (GitHub Action)
>>  https://github.com/apache/spark/pull/32144#issuecomment-819647368
>> (Jenkins)
>>
>> GitHash: 8dbed7b
>> https://github.com/apache/spark/pull/32144#issuecomment-819684782
>> (GitHub Action)
>> https://github.com/apache/spark/pull/32144#issuecomment-819578976
>> (Jenkins)
>>
>> GitHash: a3a6c5e
>> https://github.com/apache/spark/pull/32144#issuecomment-819690465
>> (GitHub Action)
>> https://github.com/apache/spark/pull/32144#issuecomment-819793557
>> (Jenkins)
>>
>> GitHash: b6d26b7
>> https://github.com/apache/spark/pull/32144#issuecomment-819691416
>> (GitHub Action)
>> https://github.com/apache/spark/pull/32144#issuecomment-819791485
>> (Jenkins)
>>
>> Could you recheck it?
>>
>>
>> 1. Github-actions notification could be wrong when another PR opened
>>> with some same commits, and you will get a lot of fake comments then.
>>> Meanwhile, the new PR get no comments, even if it is actually the
>>> chosen one.
>>>1.1 https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>>
>>
>>
>> On Wed, Apr 14, 2021 at 10:41 AM Kent Yao  wrote:
>>
>>> Hi ALL, here is something I notice after this change:
>>>
>>> 1. Github-actions notification could be wrong when another PR opened
>>> with some same commits, and you will get a lot of fake comments then.
>>> Meanwhile, the new PR get no comments, even if it is actually the
>>> chosen one.
>>>1.1 https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>> 2. New Forks have to turn on GitHub action by the fork owner manually
>>> 3. `Notify test workflow` keeps waiting when the build flow canceled
>>> or the whole fork gone
>>> 4. After refreshed master or even re-forked :(, I still got failures
>>> and seems not alone
>>>4.1. https://github.com/apache/spark/pull/32168 (PR after sync)
>>>4.2. https://github.com/apache/spark/pull/32172 (PR after re-forked)
>>>4.3.
>>> https://github.com/attilapiros/spark/runs/2344911058?check_suite_focus=true
>>> (some other failures noticed)
>>>
>>>
>>> Bests,
>>>
>>> Kent
>>>
>>> Dongjoon Hyun  于2021年4月14日周三 下午11:34写道:
>>> >
>>> > Thank you again, Hyukjin.
>>> >
>>> > Bests,
>>> > Dongjoon.
>>> >
>>> > On Wed, Apr 14, 2021 at 5:25 AM Kent Yao  wrote:
>>> >>
>>> >> Cool, thanks!
>>> >>
>>> >> Hyukjin Kwon  于2021年4月14日周三 下午8:19写道:
>>> >>>
>>> >>> Good point! I had to clarify.
>>> >>> Once is enough. The sync is needed for your branch to include the
>>> changes of https://github.com/apache/spark/pull/32092.
>>> >>>
>>> >>>
>>> >>> 2021년 4월 14일 (수) 오후 9:11, Kent Yao 님이 작성:
>>> >>>>
>>> >>>> Hi Hyukjin,
>>> >>>>
>>> >>>> > Please sync your branch to the latest master branch in Apache
>>> Spark in order for the main repository to run the workflow and detect it.
>>> >>>>
>>> >>>> Do we need to sync master for every PR or just one-time cost to
>>> keep up with the current master branch?
>>> >>>>
>>> >>>> Ken

Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-14 Thread Hyukjin Kwon
I think it works mostly correctly as Dongjoon investigated and shared
(Thanks a lot!).
One problem seems to be syncing to the master seems too strict (
https://github.com/apache/spark/pull/32168#issuecomment-819736508). Thanks
Yikun.
I think we should make it less strict. I can create a PR right away but
would like to encourage Yikun or Kent to do it in order to keep the credits
of their investigation.

2021년 4월 15일 (목) 오전 7:21, Dongjoon Hyun 님이 작성:

> Hi, Kent.
>
> I checked (1) in your PR, but those test result comments look correct to
> me.
> Please note that both Jenkins and GitHub Action leave the same number of
> comments on the same GitHash.
> Given that, there are not fake comments. It looks like a real result of
> your commits on that PR.
>
> GitHash: 23248c3
>  https://github.com/apache/spark/pull/32144#issuecomment-819679970
> (GitHub Action)
>  https://github.com/apache/spark/pull/32144#issuecomment-819647368
> (Jenkins)
>
> GitHash: 8dbed7b
> https://github.com/apache/spark/pull/32144#issuecomment-819684782
> (GitHub Action)
> https://github.com/apache/spark/pull/32144#issuecomment-819578976
> (Jenkins)
>
> GitHash: a3a6c5e
> https://github.com/apache/spark/pull/32144#issuecomment-819690465
> (GitHub Action)
> https://github.com/apache/spark/pull/32144#issuecomment-819793557
> (Jenkins)
>
> GitHash: b6d26b7
> https://github.com/apache/spark/pull/32144#issuecomment-819691416
> (GitHub Action)
> https://github.com/apache/spark/pull/32144#issuecomment-819791485
> (Jenkins)
>
> Could you recheck it?
>
>
> 1. Github-actions notification could be wrong when another PR opened
>> with some same commits, and you will get a lot of fake comments then.
>> Meanwhile, the new PR get no comments, even if it is actually the
>> chosen one.
>>1.1 https://github.com/apache/spark/pull/32144#issuecomment-819679970
>>
>
>
> On Wed, Apr 14, 2021 at 10:41 AM Kent Yao  wrote:
>
>> Hi ALL, here is something I notice after this change:
>>
>> 1. Github-actions notification could be wrong when another PR opened
>> with some same commits, and you will get a lot of fake comments then.
>> Meanwhile, the new PR get no comments, even if it is actually the
>> chosen one.
>>1.1 https://github.com/apache/spark/pull/32144#issuecomment-819679970
>> 2. New Forks have to turn on GitHub action by the fork owner manually
>> 3. `Notify test workflow` keeps waiting when the build flow canceled
>> or the whole fork gone
>> 4. After refreshed master or even re-forked :(, I still got failures
>> and seems not alone
>>4.1. https://github.com/apache/spark/pull/32168 (PR after sync)
>>4.2. https://github.com/apache/spark/pull/32172 (PR after re-forked)
>>4.3.
>> https://github.com/attilapiros/spark/runs/2344911058?check_suite_focus=true
>> (some other failures noticed)
>>
>>
>> Bests,
>>
>> Kent
>>
>> Dongjoon Hyun  于2021年4月14日周三 下午11:34写道:
>> >
>> > Thank you again, Hyukjin.
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> > On Wed, Apr 14, 2021 at 5:25 AM Kent Yao  wrote:
>> >>
>> >> Cool, thanks!
>> >>
>> >> Hyukjin Kwon  于2021年4月14日周三 下午8:19写道:
>> >>>
>> >>> Good point! I had to clarify.
>> >>> Once is enough. The sync is needed for your branch to include the
>> changes of https://github.com/apache/spark/pull/32092.
>> >>>
>> >>>
>> >>> 2021년 4월 14일 (수) 오후 9:11, Kent Yao 님이 작성:
>> >>>>
>> >>>> Hi Hyukjin,
>> >>>>
>> >>>> > Please sync your branch to the latest master branch in Apache
>> Spark in order for the main repository to run the workflow and detect it.
>> >>>>
>> >>>> Do we need to sync master for every PR or just one-time cost to keep
>> up with the current master branch?
>> >>>>
>> >>>> Kent Yao
>> >>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>> >>>> a spark enthusiast
>> >>>> kyuubiis a unified multi-tenant JDBC interface for large-scale data
>> processing and analytics, built on top of Apache Spark.
>> >>>>
>> >>>> spark-authorizerA Spark SQL extension which provides SQL Standard
>> Authorization for Apache Spark.
>> >>>> spark-postgres A library for reading data from and transferring data
>> to Postgres / Greenplum with Spark SQL and DataFrames, 10~100x faster.
>> >>>

Re: please read: current state and the future of the apache spark build system

2021-04-14 Thread Hyukjin Kwon
Thanks Shane!!

On Thu, 15 Apr 2021, 09:03 shane knapp ☠,  wrote:

> medium term (in 6 months):
>> * prepare jenkins worker ansible configs and stick in the spark repo
>>   - nothing fancy, but enough to config ubuntu workers
>>   - could be used to create docker containers for testing in
>> THE CLOUD
>>
>> fwiw, i just decided to bang this out today:
> https://github.com/apache/spark/pull/32178
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-14 Thread Hyukjin Kwon
Good point! I had to clarify.
Once is enough. The sync is needed for your branch to include the changes
of https://github.com/apache/spark/pull/32092.


2021년 4월 14일 (수) 오후 9:11, Kent Yao 님이 작성:

> Hi Hyukjin,
>
> > Please sync your branch to the latest master branch in Apache Spark in
> order for the main repository to run the workflow and detect it.
>
> Do we need to sync master for every PR or just one-time cost to keep up
> with the current master branch?
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark <http://spark.apache.org/>.*
> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark <http://spark.apache.org/>.*
> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *spark-func-extras <https://github.com/yaooqinn/spark-func-extras>A
> library that brings excellent and useful functions from various modern
> database management systems to Apache Spark <http://spark.apache.org/>.*
>
>
>
> On 04/14/2021 15:41,Kent Yao  wrote:
>
> Cool~Thanks, Hyukjin
>
> Yuanjian Li  于2021年4月14日周三 下午3:39写道:
>
>> Awesome! Thanks for making this happen, Hyukjin!
>>
>> Yi Wu  于2021年4月14日周三 下午2:51写道:
>>
>>> Thanks for the great work, Hyukjin!
>>>
>>> On Wed, Apr 14, 2021 at 1:00 PM Gengliang Wang  wrote:
>>>
>>>> Thanks for the amazing work, Hyukjin!
>>>> I created a PR for trial and it looks well so far:
>>>> https://github.com/apache/spark/pull/32158
>>>>
>>>> On Wed, Apr 14, 2021 at 12:47 PM Hyukjin Kwon 
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> After https://github.com/apache/spark/pull/32092 merged, now we run
>>>>> the GitHub Actions
>>>>> workflows in your forked repository.
>>>>>
>>>>> In short, please see this example HyukjinKwon#34
>>>>> <https://github.com/HyukjinKwon/spark/pull/34>
>>>>>
>>>>>1. You create a PR and your repository triggers the workflow. Your
>>>>>PR uses the resources allocated to you for testing.
>>>>>2. Apache Spark repository finds your workflow, and links it in a
>>>>>comment in your PR
>>>>>
>>>>> Please let me know if you guys find any weird behaviour related to
>>>>> this.
>>>>>
>>>>>
>>>>> *What does that mean to contributors?*
>>>>>
>>>>> Please sync your branch to the latest master branch in Apache Spark in
>>>>> order for your forked repository to run the workflow, and
>>>>> for the main repository to detect the workflow.
>>>>>
>>>>>
>>>>> *What does that mean to committers?*
>>>>>
>>>>> Now, GitHub Actions will show a green even when GitHub Actions builds
>>>>> are running (in contributor's forked repository).
>>>>> Please check the build notified by github-actions bot before merging
>>>>> it.
>>>>> There would be a followup work to reflect the status of the forked
>>>>> repository's build to the status of PR.
>>>>>
>>>>> 2021년 4월 14일 (수) 오후 1:42, Hyukjin Kwon 님이 작성:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> After https://github.com/apache/spark/pull/32092 merged, now we run
>>>>>> the GitHub Actions
>>>>>> workflows in your forked repository.
>>>>>>
>>>>>> In short, please see this example HyukjinKwon#34
>>>>>> <https://github.com/HyukjinKwon/spark/pull/34>
>>>>>>
>>>>>>1. You create a PR and your repository triggers the workflow.
>>>>>>Your PR uses the resources allocated to you for testing.
>>>>>>2. Apache Spark repository finds your workflow, and links it in a
>>>>>>comment in your PR
>>>>>>
>>>>>> Please let me know if you guys find any weird behaviour related to
>>>>>> this.
>>>>>>
>>>>>>
>>>>>> *What does that mean to contributors?*
>>>>>>
>>>>>> Please sync your branch to the latest master branch in Apache Spark
>>>>>> in order for the main repository to run the workflow and detect it.
>>>>>>
>>>>>>
>>>>>> *What does that mean to committers?*
>>>>>>
>>>>>> Now, GitHub Actions will show a green even when GitHub Actions builds
>>>>>> are running (in contributor's forked repository). Please check the build
>>>>>> notified by github-actions bot before merging it.
>>>>>> There would be a followup work to reflect the status of the forked
>>>>>> repository's build to
>>>>>> the status of PR.
>>>>>>
>>>>>>
>>>>>>


Re: [PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-13 Thread Hyukjin Kwon
Hi all,

After https://github.com/apache/spark/pull/32092 merged, now we run the
GitHub Actions
workflows in your forked repository.

In short, please see this example HyukjinKwon#34
<https://github.com/HyukjinKwon/spark/pull/34>

   1. You create a PR and your repository triggers the workflow. Your PR
   uses the resources allocated to you for testing.
   2. Apache Spark repository finds your workflow, and links it in a
   comment in your PR

Please let me know if you guys find any weird behaviour related to this.


*What does that mean to contributors?*

Please sync your branch to the latest master branch in Apache Spark in
order for your forked repository to run the workflow, and
for the main repository to detect the workflow.


*What does that mean to committers?*

Now, GitHub Actions will show a green even when GitHub Actions builds are
running (in contributor's forked repository).
Please check the build notified by github-actions bot before merging it.
There would be a followup work to reflect the status of the forked
repository's build to the status of PR.

2021년 4월 14일 (수) 오후 1:42, Hyukjin Kwon 님이 작성:

> Hi all,
>
> After https://github.com/apache/spark/pull/32092 merged, now we run the
> GitHub Actions
> workflows in your forked repository.
>
> In short, please see this example HyukjinKwon#34
> <https://github.com/HyukjinKwon/spark/pull/34>
>
>1. You create a PR and your repository triggers the workflow. Your PR
>uses the resources allocated to you for testing.
>2. Apache Spark repository finds your workflow, and links it in a
>comment in your PR
>
> Please let me know if you guys find any weird behaviour related to this.
>
>
> *What does that mean to contributors?*
>
> Please sync your branch to the latest master branch in Apache Spark in
> order for the main repository to run the workflow and detect it.
>
>
> *What does that mean to committers?*
>
> Now, GitHub Actions will show a green even when GitHub Actions builds are
> running (in contributor's forked repository). Please check the build
> notified by github-actions bot before merging it.
> There would be a followup work to reflect the status of the forked
> repository's build to
> the status of PR.
>
>
>


[PSA] Please read: PR builder now runs test and build in your forked repository

2021-04-13 Thread Hyukjin Kwon
Hi all,

After https://github.com/apache/spark/pull/32092 merged, now we run the
GitHub Actions
workflows in your forked repository.

In short, please see this example HyukjinKwon#34


   1. You create a PR and your repository triggers the workflow. Your PR
   uses the resources allocated to you for testing.
   2. Apache Spark repository finds your workflow, and links it in a
   comment in your PR

Please let me know if you guys find any weird behaviour related to this.


*What does that mean to contributors?*

Please sync your branch to the latest master branch in Apache Spark in
order for the main repository to run the workflow and detect it.


*What does that mean to committers?*

Now, GitHub Actions will show a green even when GitHub Actions builds are
running (in contributor's forked repository). Please check the build
notified by github-actions bot before merging it.
There would be a followup work to reflect the status of the forked
repository's build to
the status of PR.


Re: [DISCUSS] Build error message guideline

2021-04-13 Thread Hyukjin Kwon
I would just go ahead and create a PR for that. Nothing written there looks
unreasonable.
But probably it should be best to wait a couple of days to make sure people
are happy with it.

2021년 4월 14일 (수) 오전 6:38, Karen 님이 작성:

> If the proposed guidelines look good, it would be useful to share these
> guidelines with the wider community. A good landing page for contributors
> could be https://spark.apache.org/contributing.html. What do you think?
>
> Thank you,
>
> Karen Feng
>
> On Wed, Apr 7, 2021 at 8:19 PM Hyukjin Kwon  wrote:
>
>> LGTM (I took a look, and had some offline discussions w/ some corrections
>> before it came out)
>>
>> 2021년 4월 8일 (목) 오전 5:28, Karen 님이 작성:
>>
>>> Hi all,
>>>
>>> As discussed in SPIP: Standardize Exception Messages in Spark (
>>> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
>>> improving error message quality in Apache Spark involves establishing an
>>> error message guideline for developers. Error message style guidelines are
>>> common practice across open-source projects, for example PostgreSQL (
>>> https://www.postgresql.org/docs/current/error-style-guide.html).
>>>
>>> To move towards the goal of improving error message quality, we would
>>> like to start building an error message guideline. We have attached a rough
>>> draft to kick off this discussion:
>>> https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
>>> .
>>>
>>> Please let us know what you think should be in the guideline! We look
>>> forward to building this as a community.
>>>
>>> Thank you,
>>>
>>> Karen Feng
>>>
>>


Re: [VOTE] Release Spark 2.4.8 (RC2)

2021-04-12 Thread Hyukjin Kwon
+1

On Tue, 13 Apr 2021, 02:58 Sean Owen,  wrote:

> +1 same result as last RC for me.
>
> On Mon, Apr 12, 2021, 12:53 AM Liang-Chi Hsieh  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 2.4.8.
>>
>> The vote is open until Apr 15th at 9AM PST and passes if a majority +1 PMC
>> votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 2.4.8
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> There are currently no issues targeting 2.4.8 (try project = SPARK AND
>> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In
>> Progress"))
>>
>> The tag to be voted on is v2.4.8-rc2 (commit
>> a0ab27ca6b46b8e5a7ae8bb91e30546082fc551c):
>> https://github.com/apache/spark/tree/v2.4.8-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1373/
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc2-docs/
>>
>> The list of bug fixes going into 2.4.8 can be found at the following URL:
>> https://s.apache.org/spark-v2.4.8-rc2
>>
>> This release is using the release script of the tag v2.4.8-rc2.
>>
>> FAQ
>>
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 2.4.8?
>> ===
>>
>> The current list of open tickets targeted at 2.4.8 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 2.4.8
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Hyukjin Kwon
- builds

FYI, cc'ing Spark dev was dropped during the discussion. If you haven't
subscribed to builds@a.g, you have seen the partial discussions only.
Please subscribe bui...@apache.org mailing list to participate in the
discussion further.


2021년 4월 8일 (목) 오후 1:50, Wenchen Fan 님이 작성:

> > for example, having sub-groups where each group shares the resources -
> currently one GitHub organisation shares all resources across the projects.
>
> That's a good idea. We do need to thank Github to give free resources to
> ASF projects, but it's better if we can make it a business: we allow
> individual projects to sign deals with Github to get dedicated resources.
> It's a bit wasteful to ask every project to set up its own dev ops,
> using Github Action is more convenient. Maybe we should raise it to Github?
>
> On Wed, Apr 7, 2021 at 9:31 PM Hyukjin Kwon  wrote:
>
>> Thanks Martin for your feedback.
>>
>> > What was your reason to migrate from Apache Jenkins to Github Actions ?
>>
>> I am sure there were more reasons for migrating from Amplap Jenkins
>> <https://amplab.cs.berkeley.edu/jenkins/> to GitHub Actions but as far
>> as I can remember:
>> - To reduce the maintenance cost of machines
>> - The Jenkins machines became unstable and slow causing CI jobs to fail
>> or be very flaky.
>> - Difficulty to manage the installed libraries.
>> - Intermittent unknown issues in the machines
>>
>> Yes, one option might be to consider other options to migrate again.
>> However, other projects will very likely suffer the
>> same problem. In addition, the migration in a large project is not an
>> easy work to do
>>
>> I would like to know the feasibility of having more resources in GitHub
>> Actions, or, for example, having sub-groups where
>> each group shares the resources - currently one GitHub organisation
>> shares all resources across the projects.
>>
>>
>> 2021년 4월 7일 (수) 오후 10:04, Martin Grigorov 님이 작성:
>>
>>>
>>>
>>> On Wed, Apr 7, 2021 at 3:41 PM Hyukjin Kwon  wrote:
>>>
>>>> Hi Greg,
>>>>
>>>> I raised this thread to figure out a way that we can work together to
>>>> resolve this issue, gather feedback, and to understand how other
>>>> projects
>>>> work around.
>>>> Several projects I observed, as far as I can tell, have made enough
>>>> efforts
>>>> to save the resources in GitHub Actions but still suffer from the lack
>>>> of
>>>> resources.
>>>>
>>>
>>> And it will get even worse because:
>>> 1) more and more Apache projects migrate from TravisCI to Github Actions
>>> (GA)
>>> 2) new projects join ASF and many of them already use GA
>>>
>>>
>>> What was your reason to migrate from Apache Jenkins to Github Actions ?
>>> If you want dedicated resources then you will need to manage the CI
>>> yourself.
>>> You could use Apache Jenkins/Buildbot with dedicated agents for your
>>> project.
>>> Or you could set up your own CI infrastructure with Jenkins, DroneIO,
>>> ConcourceCI, ...
>>>
>>> Yet another option is to move to CircleCI or Cirrus. They are similar to
>>> TravisCI / GA and less crowded (for now).
>>>
>>> Martin
>>>
>>> I appreciate the resources provided to us but that does not resolve the
>>>> issue of the development being slowed down.
>>>>
>>>>
>>>> 2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:
>>>>
>>>> > On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon 
>>>> wrote:
>>>> >
>>>> >> Hi all,
>>>> >>
>>>> >> I am an Apache Spark PMC,
>>>> >
>>>> >
>>>> > You are a member of the Apache Spark PMC. You are *not* a PMC. Please
>>>> stop
>>>> > with that terminology. The Foundation has about 200 PMCs, and you are
>>>> a
>>>> > member of one of them. You are NOT a "PMC" .. you're a person. A PMC
>>>> is a
>>>> > construct of the Foundation.
>>>> >
>>>> > >...
>>>> >
>>>> >> I am aware of the limited GitHub Actions resources that are shared
>>>> >> across all projects in ASF,
>>>> >> and many projects suffer from it. This issue significantly slows
>>>> down the
>>>> >> development cycle of
>>>> >>  other projects, at least Apache Spark.
>>>> >>
>>>> >
>>>> > And the Foundation gets those build minutes for GitHub Actions
>>>> provided to
>>>> > us from GitHub and Microsoft, and we are thankful that they provide
>>>> them to
>>>> > the Foundation. Maybe it isn't all the build minutes that every group
>>>> > wants, but that is what we have. So it is incumbent upon all of us to
>>>> > figure out how to build more, with fewer minutes.
>>>> >
>>>> > Say "thank you" to GitHub, please.
>>>> >
>>>> > Regards,
>>>> > -g
>>>> >
>>>> >
>>>>
>>>


Re: [DISCUSS] Build error message guideline

2021-04-07 Thread Hyukjin Kwon
LGTM (I took a look, and had some offline discussions w/ some corrections
before it came out)

2021년 4월 8일 (목) 오전 5:28, Karen 님이 작성:

> Hi all,
>
> As discussed in SPIP: Standardize Exception Messages in Spark (
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing),
> improving error message quality in Apache Spark involves establishing an
> error message guideline for developers. Error message style guidelines are
> common practice across open-source projects, for example PostgreSQL (
> https://www.postgresql.org/docs/current/error-style-guide.html).
>
> To move towards the goal of improving error message quality, we would like
> to start building an error message guideline. We have attached a rough
> draft to kick off this discussion:
> https://docs.google.com/document/d/12k4zmaKmmdm6Pk63HS0N1zN1QT-6TihkWaa5CkLmsn8/edit?usp=sharing
> .
>
> Please let us know what you think should be in the guideline! We look
> forward to building this as a community.
>
> Thank you,
>
> Karen Feng
>


Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Hyukjin Kwon
Thanks Martin for your feedback.

> What was your reason to migrate from Apache Jenkins to Github Actions ?

I am sure there were more reasons for migrating from Amplap Jenkins
<https://amplab.cs.berkeley.edu/jenkins/> to GitHub Actions but as far as I
can remember:
- To reduce the maintenance cost of machines
- The Jenkins machines became unstable and slow causing CI jobs to fail or
be very flaky.
- Difficulty to manage the installed libraries.
- Intermittent unknown issues in the machines

Yes, one option might be to consider other options to migrate again.
However, other projects will very likely suffer the
same problem. In addition, the migration in a large project is not an
easy work to do

I would like to know the feasibility of having more resources in GitHub
Actions, or, for example, having sub-groups where
each group shares the resources - currently one GitHub organisation shares
all resources across the projects.


2021년 4월 7일 (수) 오후 10:04, Martin Grigorov 님이 작성:

>
>
> On Wed, Apr 7, 2021 at 3:41 PM Hyukjin Kwon  wrote:
>
>> Hi Greg,
>>
>> I raised this thread to figure out a way that we can work together to
>> resolve this issue, gather feedback, and to understand how other projects
>> work around.
>> Several projects I observed, as far as I can tell, have made enough
>> efforts
>> to save the resources in GitHub Actions but still suffer from the lack of
>> resources.
>>
>
> And it will get even worse because:
> 1) more and more Apache projects migrate from TravisCI to Github Actions
> (GA)
> 2) new projects join ASF and many of them already use GA
>
>
> What was your reason to migrate from Apache Jenkins to Github Actions ?
> If you want dedicated resources then you will need to manage the CI
> yourself.
> You could use Apache Jenkins/Buildbot with dedicated agents for your
> project.
> Or you could set up your own CI infrastructure with Jenkins, DroneIO,
> ConcourceCI, ...
>
> Yet another option is to move to CircleCI or Cirrus. They are similar to
> TravisCI / GA and less crowded (for now).
>
> Martin
>
> I appreciate the resources provided to us but that does not resolve the
>> issue of the development being slowed down.
>>
>>
>> 2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:
>>
>> > On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon 
>> wrote:
>> >
>> >> Hi all,
>> >>
>> >> I am an Apache Spark PMC,
>> >
>> >
>> > You are a member of the Apache Spark PMC. You are *not* a PMC. Please
>> stop
>> > with that terminology. The Foundation has about 200 PMCs, and you are a
>> > member of one of them. You are NOT a "PMC" .. you're a person. A PMC is
>> a
>> > construct of the Foundation.
>> >
>> > >...
>> >
>> >> I am aware of the limited GitHub Actions resources that are shared
>> >> across all projects in ASF,
>> >> and many projects suffer from it. This issue significantly slows down
>> the
>> >> development cycle of
>> >>  other projects, at least Apache Spark.
>> >>
>> >
>> > And the Foundation gets those build minutes for GitHub Actions provided
>> to
>> > us from GitHub and Microsoft, and we are thankful that they provide
>> them to
>> > the Foundation. Maybe it isn't all the build minutes that every group
>> > wants, but that is what we have. So it is incumbent upon all of us to
>> > figure out how to build more, with fewer minutes.
>> >
>> > Say "thank you" to GitHub, please.
>> >
>> > Regards,
>> > -g
>> >
>> >
>>
>


Re: Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-07 Thread Hyukjin Kwon
Hi Greg,

I raised this thread to figure out a way that we can work together to
resolve this issue, gather feedback, and to understand how other projects
work around.
Several projects I observed, as far as I can tell, have made enough efforts
to save the resources in GitHub Actions but still suffer from the lack of
resources.
I appreciate the resources provided to us but that does not resolve the
issue of the development being slowed down.


2021년 4월 7일 (수) 오후 5:52, Greg Stein 님이 작성:

> On Wed, Apr 7, 2021 at 12:25 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I am an Apache Spark PMC,
>
>
> You are a member of the Apache Spark PMC. You are *not* a PMC. Please stop
> with that terminology. The Foundation has about 200 PMCs, and you are a
> member of one of them. You are NOT a "PMC" .. you're a person. A PMC is a
> construct of the Foundation.
>
> >...
>
>> I am aware of the limited GitHub Actions resources that are shared
>> across all projects in ASF,
>> and many projects suffer from it. This issue significantly slows down the
>> development cycle of
>>  other projects, at least Apache Spark.
>>
>
> And the Foundation gets those build minutes for GitHub Actions provided to
> us from GitHub and Microsoft, and we are thankful that they provide them to
> the Foundation. Maybe it isn't all the build minutes that every group
> wants, but that is what we have. So it is incumbent upon all of us to
> figure out how to build more, with fewer minutes.
>
> Say "thank you" to GitHub, please.
>
> Regards,
> -g
>
>


Increase the number of parallel jobs in GitHub Actions at ASF organization level

2021-04-06 Thread Hyukjin Kwon
Hi all,

I am an Apache Spark PMC, and would like to know the future plan about
GitHub Actions in ASF.
Please also see the INFRA ticket I filed:
https://issues.apache.org/jira/browse/INFRA-21646.

I am aware of the limited GitHub Actions resources that are shared
across all projects in ASF,
and many projects suffer from it. This issue significantly slows down the
development cycle of
 other projects, at least Apache Spark.

How do we plan to increase the resources in GitHub Actions, and what are
the blockers? I would appreciate any input and thoughts on this.

Thank you so much.

CC'ing Spark @dev  for more visibility. Please take
it out if considered inappropriate.


Re: Support User Defined Types in pandas_udf for Spark's own Python API

2021-04-06 Thread Hyukjin Kwon
Yeah, we still should improve PySpark APIs together. I am currently stuck
at some work and porting Koalas at this moment so couldn't have a chance to
take a very close look (but drop some comments and skim).

2021년 4월 6일 (화) 오후 5:31, Darcy Shen 님이 작성:

> was: [DISCUSS] Support pandas API layer on PySpark
>
>
> I'm working on [SPARK-34600] Support user defined types in Pandas UDF -
> ASF JIRA (apache.org) <https://issues.apache.org/jira/browse/SPARK-34600>.
>
> I'm wondering if we are still working on improving Spark's own Python API.
>
> SPARK-34600 is relatively a big feature for PySpark. I splited it into
> several small tickets and submitted the first small PR:
>
> [SPARK-34771] Support UDT for Pandas/Spark conversion with Arrow support
> Enabled by sadhen · Pull Request #32026 · apache/spark (github.com)
> <https://github.com/apache/spark/pull/32026>
>
> I'm afraid that the Spark community are busy working on pandas API layer
> on PySpark and the improvements for Spark's own Python API will be
> postponed and postponed.
>
> As gongjonn.hyun said:
> > BTW, what is the future plan for the existing APIs?
>
> If we are keeping these existing APIs, will we add new features for
> Spark's own Python API?
>
> Or will we fix bugs for Spark's own Python API?
>
> Specifically, will we add support for User Defined Types in pandas_udf for
> Spark's own Python API?
>
>
>  On Mon, 2021-03-15 14:12:28 *Reynold Xin  >* wrote 
>
> I don't think we should deprecate existing APIs.
>
> Spark's own Python API is relatively stable and not difficult to support.
> It has a pretty large number of users and existing code. Also pretty easy
> to learn by data engineers.
>
> pandas API is a great for data science, but isn't that great for some
> other tasks. It's super wide. Great for data scientists that have learned
> it, or great for copy paste from Stackoverflow.
>
>
>
>
>
> On Sun, Mar 14, 2021 at 11:08 PM, Dongjoon Hyun 
> wrote:
>
> Thank you for the proposal. It looks like a good addition.
> BTW, what is the future plan for the existing APIs?
> Are we going to deprecate it eventually in favor of Koalas (because we
> don't remove the existing APIs in general)?
>
> > Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> > PySpark follow pascalCase?" or "PySpark APIs are difficult to learn",
> and APIs are very difficult to change
> > in Spark (as I emphasized above).
>
>
> On Sun, Mar 14, 2021 at 4:03 AM Hyukjin Kwon  wrote:
>
> Firstly my biggest reason is that I would like to promote this more as a
> built-in support because it is simply
> important to have it with the impact on the large user group, and the
> needs are increasing
> as the charts indicate. I usually think that features or add-ons stay as
> third parties when it’s rather for a
> smaller set of users, it addresses a corner case of needs, etc. I think
> this is similar to the datasources
> we have added. Spark ported CSV and Avro because more and more people use
> it, and it became important
> to have it as a built-in support.
>
> Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
> experts from the
> bigger community. Koalas’ team isn’t experts in all the areas, and there
> are many missing corner
> cases to fix, Some require deep expertise from specific areas.
>
> One example is the type hints. Koalas uses type hints for schema inference.
> Due to the lack of Python’s type hinting way, Koalas added its own
> (hacky) way
> <https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
> .
> Fortunately the way Koalas implemented is now partially proposed into
> Python officially (PEP 646).
> But Koalas could have been better with interacting with the Python
> community more and actively
> joining in the design issues together to lead the best output that
> benefits both and more projects.
>
> Thirdly, I would like to contribute to the growth of PySpark. The growth
> of the Koalas is very fast given the
> internal and external stats. The number of users has jumped up twice
> almost every 4 ~ 6 months.
> I think Koalas will be a good momentum to keep Spark up.
> Fourthly, PySpark is still not Pythonic enough. For example, I hear
> complaints such as "why does
> PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
> APIs are very difficult to change
> in Spark (as I emphasized above). This set of Koalas APIs will be able to
> address these concerns
> in PySpark.
>
> Lastly, I really think PySpark needs its native plotting features. As I
> emphas

Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-04-04 Thread Hyukjin Kwon
I would +1for just going ahead. That looks flaky to me too.

Thanks Langchi for driving this!

On Sun, 4 Apr 2021, 18:17 Liang-Chi Hsieh,  wrote:

> Hi devs,
>
> Currently no open issues or ongoing issues targeting 2.4.
>
> On QA test dashboard, only spark-branch-2.4-test-sbt-hadoop-2.6 is in red
> status. The failed test is
>
> org.apache.spark.sql.streaming.StreamingQueryManagerSuite.awaitAnyTermination
> with timeout and resetTerminated. It looks a flaky test to me. It was
> passed
> locally too.
>
> So I'm wondering if I could directly go to cut 2.4.8 RC1 given one red
> light? Or we need to re-trigger the failed Jenkins build and wait it to be
> greed?
>
>
> Liang-Chi
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-29 Thread Hyukjin Kwon
The vote passed with the following 20 +1 votes and no -1 or +0 votes:

Hyukjin Kwon*
Dongjoon Hyun*
Maciej Szymkiewicz
Bryan Cutler
Reynold Xin*
Liang-Chi Hsieh
Takeshi Yamamuro
Xiao Li*
Mridul Muralidharan*
Gengliang Wang
Matei Zaharia*
Maxim Gekk
郑瑞峰 (Ruifeng Zheng)
Denny Lee
Kousuke Saruta
Holden Karau*
Wenchen Fan*
Ismaël Mejía
Takuya Ueshin*
Femi Anthony

* = binding

Thank you guys all for your feedback and votes.

2021년 3월 30일 (화) 오전 5:21, Femi Anthony 님이 작성:

> +1
>
> On Mon, Mar 29, 2021 at 1:34 PM Takuya UESHIN 
> wrote:
>
>> +1
>>
>> On Mon, Mar 29, 2021 at 3:35 AM Ismaël Mejía  wrote:
>>
>>> +1 (non-binding)
>>>
>>> On Mon, Mar 29, 2021 at 7:54 AM Wenchen Fan  wrote:
>>> >
>>> > +1
>>> >
>>> > On Mon, Mar 29, 2021 at 1:45 PM Holden Karau 
>>> wrote:
>>> >>
>>> >> +1
>>> >>
>>> >> On Sun, Mar 28, 2021 at 10:25 PM sarutak 
>>> wrote:
>>> >>>
>>> >>> +1 (non-binding)
>>> >>>
>>> >>> - Kousuke
>>> >>>
>>> >>> > +1 (non-binding)
>>> >>> >
>>> >>> > On Sun, Mar 28, 2021 at 9:06 PM 郑瑞峰 
>>> >>> > wrote:
>>> >>> >
>>> >>> >> +1 (non-binding)
>>> >>> >>
>>> >>> >> -- 原始邮件 --
>>> >>> >>
>>> >>> >> 发件人: "Maxim Gekk" ;
>>> >>> >> 发送时间: 2021年3月29日(星期一) 凌晨2:08
>>> >>> >> 收件人: "Matei Zaharia";
>>> >>> >> 抄送: "Gengliang Wang";"Mridul
>>> >>> >> Muralidharan";"Xiao
>>> >>> >> Li";"Spark dev
>>> >>> >> list";"Takeshi
>>> >>> >> Yamamuro";
>>> >>> >> 主题: Re: [VOTE] SPIP: Support pandas API layer on PySpark
>>> >>> >>
>>> >>> >> +1 (non-binding)
>>> >>> >>
>>> >>> >> On Sun, Mar 28, 2021 at 8:53 PM Matei Zaharia
>>> >>> >>  wrote:
>>> >>> >>
>>> >>> >> +1
>>> >>> >>
>>> >>> >> Matei
>>> >>> >>
>>> >>> >> On Mar 28, 2021, at 1:45 AM, Gengliang Wang 
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >> +1 (non-binding)
>>> >>> >>
>>> >>> >> On Sun, Mar 28, 2021 at 11:12 AM Mridul Muralidharan
>>> >>> >>  wrote:
>>> >>> >>
>>> >>> >> +1
>>> >>> >>
>>> >>> >> Regards,
>>> >>> >> Mridul
>>> >>> >>
>>> >>> >> On Sat, Mar 27, 2021 at 6:09 PM Xiao Li 
>>> >>> >> wrote:
>>> >>> >>
>>> >>> >> +1
>>> >>> >>
>>> >>> >> Xiao
>>> >>> >>
>>> >>> >> Takeshi Yamamuro  于2021年3月26日周五
>>> >>> >> 下午4:14写道:
>>> >>> >>
>>> >>> >> +1 (non-binding)
>>> >>> >>
>>> >>> >> On Sat, Mar 27, 2021 at 4:53 AM Liang-Chi Hsieh >> >
>>> >>> >> wrote:
>>> >>> >> +1 (non-binding)
>>> >>> >>
>>> >>> >> rxin wrote
>>> >>> >>> +1. Would open up a huge persona for Spark.
>>> >>> >>>
>>> >>> >>> On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler <
>>> >>> >>
>>> >>> >>> cutlerb@
>>> >>> >>
>>> >>> >>>> wrote:
>>> >>> >>>
>>> >>> >>>>
>>> >>> >>>> +1 (non-binding)
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>> On Fri, Mar 26, 2021 at 9:49 AM Maciej <
>>> >>> >>
>>> >>> >>> mszymkiewicz@
>>> >>> >>
>>> >>> >>>> wrote:
>>> >>> >>>>
>>> >>> >>>>
>>> >>> >>>>> +1 (nonbinding)
>>> >>> >>
>>> >>> >> --
>>> >>> >> Sent from:
>>> >>> >> http://apache-spark-developers-list.1001551.n3.nabble.com/
>>> >>> >>
>>> >>> >>
>>> >>> >
>>> -
>>> >>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>> >>
>>> >>> >> --
>>> >>> >>
>>> >>> >> ---
>>> >>> >> Takeshi Yamamuro
>>> >>>
>>> >>> -
>>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>>
>>> >> --
>>> >> Twitter: https://twitter.com/holdenkarau
>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9
>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Takuya UESHIN
>>
>>
>
> --
> http://dataphantik.com
>
> "Great spirits have always encountered violent opposition from mediocre
> minds." - Albert Einstein.
>


Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Hyukjin Kwon
Congrats guys. Well deserved!

On Sat, 27 Mar 2021, 05:28 Matei Zaharia,  wrote:

> Hi all,
>
> The Spark PMC recently voted to add several new committers. Please join me
> in welcoming them to their new role! Our new committers are:
>
> - Maciej Szymkiewicz (contributor to PySpark)
> - Max Gekk (contributor to Spark SQL)
> - Kent Yao (contributor to Spark SQL)
> - Attila Zsolt Piros (contributor to decommissioning and Spark on
> Kubernetes)
> - Yi Wu (contributor to Spark Core and SQL)
> - Gabor Somogyi (contributor to Streaming and security)
>
> All six of them contributed to Spark 3.1 and we’re very excited to have
> them join as committers.
>
> Matei and the Spark PMC
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Hyukjin Kwon
I'll start with my +1 (binding)

On Fri, 26 Mar 2021, 23:52 Hyukjin Kwon,  wrote:

> Hi all,
>
> I’d like to start a vote for SPIP: Support pandas API layer on PySpark.
>
> The proposal is to embrace Koalas in PySpark to have the pandas API layer
> on PySpark.
>
>
> Please also refer to:
>
>- Previous discussion in dev mailing list: [DISCUSS] Support pandas
>API layer on PySpark
>
> <http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html>
>.
>- JIRA: SPARK-34849 <https://issues.apache.org/jira/browse/SPARK-34849>
>- Koalas internals documentation:
>
> https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit
>
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
>
>


[VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Hyukjin Kwon
Hi all,

I’d like to start a vote for SPIP: Support pandas API layer on PySpark.

The proposal is to embrace Koalas in PySpark to have the pandas API layer
on PySpark.


Please also refer to:

   - Previous discussion in dev mailing list: [DISCUSS] Support pandas API
   layer on PySpark
   

   .
   - JIRA: SPARK-34849 
   - Koalas internals documentation:
   
https://docs.google.com/document/d/1tk24aq6FV5Wu2bX_Ym606doLFnrZsh4FdUd52FqojZU/edit


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
Thanks Nicholas for the pointer :-).

On Thu, 18 Mar 2021, 00:11 Nicholas Chammas, 
wrote:

> On Tue, Mar 16, 2021 at 9:15 PM Hyukjin Kwon  wrote:
>
>>   I am currently thinking we will have to convert the Koalas tests to use
>> unittests to match with PySpark for now.
>>
> Keep in mind that pytest supports unittest-based tests out of the box
> <https://docs.pytest.org/en/stable/unittest.html>, so you should be able
> to run pytest against the PySpark codebase without changing much about the
> tests.
>


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
Yeah, that's a good point, Georg. I think we will port as is first, and
discuss further about that indexing system.
We should probably either add non-index mode or switch it to a distributed
default index type that minimizes the side effect in query plan.
We still have some months left. I will very likely raise another discussion
about it in a PR or dev mailing list after finishing the initial porting.

2021년 3월 17일 (수) 오후 8:33, Georg Heiler 님이 작성:

> Would you plan to keep the existing indexing mechanism then?
>
> https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#use-distributed-or-distributed-sequence-default-index
> For me, it always even when trying to use the distributed version resulted
> in various window functions being chained, a different query plan than the
> default query plan, and slower execution of the job due to this overhead.
>
> Especially when some people here are thinking about making it the
> default/replacing the regular API I would strongly suggest defaulting to an
> indexing mechanism that is not changing the query plan.
>
> Best,
> Georg
>
> Am Mi., 17. März 2021 um 12:13 Uhr schrieb Hyukjin Kwon <
> gurwls...@gmail.com>:
>
>> > Just out of curiosity, does Koalas pretty much implement all of the
>> Pandas APIs now? If there are some that are yet to be implemented or others
>> that have differences, are these documented so users won't be caught
>> off-guard?
>>
>> It's roughly 75% done so far (in Series, DataFrame and Index).
>> Yeah, and it throws an exception that says it's not implemented yet
>> properly (or intentionally not implemented, e.g.) Series.__iter__ that will
>> easily make users shoot their feet by, for example, for loop ... ).
>>
>>
>> 2021년 3월 17일 (수) 오후 2:17, Bryan Cutler 님이 작성:
>>
>>> +1 the proposal sounds good to me. Having a familiar API built-in will
>>> really help new users get into using Spark that might only have Pandas
>>> experience. It sounds like maintenance costs should be manageable, once the
>>> hurdle with setting up tests is done. Just out of curiosity, does Koalas
>>> pretty much implement all of the Pandas APIs now? If there are some that
>>> are yet to be implemented or others that have differences, are these
>>> documented so users won't be caught off-guard?
>>>
>>> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Integrating Koalas with pyspark might help enable a richer integration
>>>> between the two. Something that would be useful with a tighter
>>>> integration is support for custom column array types. Currently, Spark
>>>> takes dataframes, converts them to arrow buffers then transmits them
>>>> over the socket to Python. On the other side, pyspark takes the arrow
>>>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>>>> default Pandas representation of a list-type for a column causes it to
>>>> turn what was contiguous value/offset arrays in Arrow into
>>>> deserialized Python objects for each row. Obviously, this kills
>>>> performance.
>>>>
>>>> A PR to extend the pyspark API to elide the pandas conversion
>>>> (https://github.com/apache/spark/pull/26783) was submitted and
>>>> rejected, which is unfortunate, but perhaps this proposed integration
>>>> would provide the hooks via Pandas' ExtensionArray interface to allow
>>>> Spark to performantly interchange jagged/ragged lists to/from python
>>>> UDFs.
>>>>
>>>> Cheers
>>>> Andrew
>>>>
>>>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon 
>>>> wrote:
>>>> >
>>>> > Thank you guys for all your feedback. I will start working on SPIP
>>>> with Koalas team.
>>>> > I would expect the SPIP can be sent late this week or early next week.
>>>> >
>>>> >
>>>> > I inlined and answered the questions unanswered as below:
>>>> >
>>>> > Is the community developing the pandas API layer for Spark interested
>>>> in being part of Spark or do they prefer having their own release cycle?
>>>> >
>>>> > Yeah, Koalas team used to have its own release cycle to develop and
>>>> move quickly.
>>>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>>>> that it’s now
>>>> > fine to have less frequent releases, and they are happy to work
>>>> together with S

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-17 Thread Hyukjin Kwon
> Just out of curiosity, does Koalas pretty much implement all of the
Pandas APIs now? If there are some that are yet to be implemented or others
that have differences, are these documented so users won't be caught
off-guard?

It's roughly 75% done so far (in Series, DataFrame and Index).
Yeah, and it throws an exception that says it's not implemented yet
properly (or intentionally not implemented, e.g.) Series.__iter__ that will
easily make users shoot their feet by, for example, for loop ... ).


2021년 3월 17일 (수) 오후 2:17, Bryan Cutler 님이 작성:

> +1 the proposal sounds good to me. Having a familiar API built-in will
> really help new users get into using Spark that might only have Pandas
> experience. It sounds like maintenance costs should be manageable, once the
> hurdle with setting up tests is done. Just out of curiosity, does Koalas
> pretty much implement all of the Pandas APIs now? If there are some that
> are yet to be implemented or others that have differences, are these
> documented so users won't be caught off-guard?
>
> On Tue, Mar 16, 2021 at 6:54 PM Andrew Melo  wrote:
>
>> Hi,
>>
>> Integrating Koalas with pyspark might help enable a richer integration
>> between the two. Something that would be useful with a tighter
>> integration is support for custom column array types. Currently, Spark
>> takes dataframes, converts them to arrow buffers then transmits them
>> over the socket to Python. On the other side, pyspark takes the arrow
>> buffer and converts it to a Pandas dataframe. Unfortunately, the
>> default Pandas representation of a list-type for a column causes it to
>> turn what was contiguous value/offset arrays in Arrow into
>> deserialized Python objects for each row. Obviously, this kills
>> performance.
>>
>> A PR to extend the pyspark API to elide the pandas conversion
>> (https://github.com/apache/spark/pull/26783) was submitted and
>> rejected, which is unfortunate, but perhaps this proposed integration
>> would provide the hooks via Pandas' ExtensionArray interface to allow
>> Spark to performantly interchange jagged/ragged lists to/from python
>> UDFs.
>>
>> Cheers
>> Andrew
>>
>> On Tue, Mar 16, 2021 at 8:15 PM Hyukjin Kwon  wrote:
>> >
>> > Thank you guys for all your feedback. I will start working on SPIP with
>> Koalas team.
>> > I would expect the SPIP can be sent late this week or early next week.
>> >
>> >
>> > I inlined and answered the questions unanswered as below:
>> >
>> > Is the community developing the pandas API layer for Spark interested
>> in being part of Spark or do they prefer having their own release cycle?
>> >
>> > Yeah, Koalas team used to have its own release cycle to develop and
>> move quickly.
>> > Now it became pretty mature with reaching 1.7.0, and the team thinks
>> that it’s now
>> > fine to have less frequent releases, and they are happy to work
>> together with Spark with
>> > contributing to it. The active contributors in the Koalas community
>> will continue to
>> > make the contributions in Spark.
>> >
>> > How about test code? Does it fit into the PySpark test framework?
>> >
>> > Yes, this will be one of the places where it needs some efforts. Koalas
>> currently uses pytest
>> > with various dependency version combinations (e.g., Python version,
>> conda vs pip) whereas
>> > PySpark uses the plain unittests with less dependency version
>> combinations.
>> >
>> > For pytest in Koalas <> unittests in PySpark:
>> >
>> >   I am currently thinking we will have to convert the Koalas tests to
>> use unittests to match
>> >   with PySpark for now.
>> >   It is a feasible option for PySpark to migrate to pytest too but it
>> will need extra effort to
>> >   make it working with our own PySpark testing framework seamlessly.
>> >   Koalas team (presumably and likely I) will take a look in any event.
>> >
>> > For the combinations of dependency versions:
>> >
>> >   Due to the lack of the resources in GitHub Actions, I currently plan
>> to just add the
>> >   Koalas tests into the matrix PySpark is currently using.
>> >
>> > one question I have; what’s an initial goal of the proposal?
>> > Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> > Or, the basic set of them?
>> >
>> > The goal of the proposal is to port all of Koalas project into PySpark.
>> > For example,
>> >
>> > impo

Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Hyukjin Kwon
Thank you guys for all your feedback. I will start working on SPIP with
Koalas team.
I would expect the SPIP can be sent late this week or early next week.


I inlined and answered the questions unanswered as below:

Is the community developing the pandas API layer for Spark interested in
being part of Spark or do they prefer having their own release cycle?

Yeah, Koalas team used to have its own release cycle to develop and move
quickly.
Now it became pretty mature with reaching 1.7.0, and the team thinks that
it’s now
fine to have less frequent releases, and they are happy to work together
with Spark with
contributing to it. The active contributors in the Koalas community will
continue to
make the contributions in Spark.

How about test code? Does it fit into the PySpark test framework?

Yes, this will be one of the places where it needs some efforts. Koalas
currently uses pytest
with various dependency version combinations (e.g., Python version, conda
vs pip) whereas
PySpark uses the plain unittests with less dependency version combinations.

For pytest in Koalas <> unittests in PySpark:

  I am currently thinking we will have to convert the Koalas tests to use
unittests to match
  with PySpark for now.
  It is a feasible option for PySpark to migrate to pytest too but it will
need extra effort to
  make it working with our own PySpark testing framework seamlessly.
  Koalas team (presumably and likely I) will take a look in any event.

For the combinations of dependency versions:

  Due to the lack of the resources in GitHub Actions, I currently plan to
just add the
  Koalas tests into the matrix PySpark is currently using.

one question I have; what’s an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already
implemented?
Or, the basic set of them?

The goal of the proposal is to port all of Koalas project into PySpark.
For example,

import koalas

will be equivalent to

# Names, etc. might change in the final proposal or during the review
from pyspark.sql import pandas

Koalas supports pandas APIs with a separate layer to cover a bit of
difference between
DataFrame structures in pandas and PySpark, e.g.) other types as column
names (labels),
index (something like row number in DBMSs) and so on. So I think it would
make more sense
to port the whole layer instead of a subset of the APIs.





2021년 3월 17일 (수) 오전 12:32, Wenchen Fan 님이 작성:

> +1, it's great to have Pandas support in Spark out of the box.
>
> On Tue, Mar 16, 2021 at 10:12 PM Takeshi Yamamuro 
> wrote:
>
>> +1; the pandas interfaces are pretty popular and supporting them in
>> pyspark looks promising, I think.
>> one question I have; what's an initial goal of the proposal?
>> Is that to port all the pandas interfaces that Koalas has already
>> implemented?
>> Or, the basic set of them?
>>
>> On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía  wrote:
>>
>>> +1
>>>
>>> Bringing a Pandas API for pyspark to upstream Spark will only bring
>>> benefits for everyone (more eyes to use/see/fix/improve the API) as
>>> well as better alignment with core Spark improvements, the extra
>>> weight looks manageable.
>>>
>>> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>>>  wrote:
>>> >
>>> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin 
>>> wrote:
>>> >>
>>> >> I don't think we should deprecate existing APIs.
>>> >
>>> >
>>> > +1
>>> >
>>> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
>>> could be wrong, but I wager most people who have worked with both Spark and
>>> Pandas feel the same way.
>>> >
>>> > For the large community of current PySpark users, or users switching
>>> to PySpark from another Spark language API, it doesn't make sense to
>>> deprecate the current API, even by convention.
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-14 Thread Hyukjin Kwon
Firstly my biggest reason is that I would like to promote this more as a
built-in support because it is simply
important to have it with the impact on the large user group, and the needs
are increasing
as the charts indicate. I usually think that features or add-ons stay as
third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think
this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use
it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there
are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
way
<https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
.
Fortunately the way Koalas implemented is now partially proposed into
Python officially (PEP 646).
But Koalas could have been better with interacting with the Python
community more and actively
joining in the design issues together to lead the best output that benefits
both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of
the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost
every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.
Fourthly, PySpark is still not Pythonic enough. For example, I hear
complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to
address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I
emphasized before with
elaboration, I do think this is an important feature missing in PySpark
that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen 님이 작성:

> I like koalas a lot. Playing devil's advocate, why not just let it
> continue to live as an add on? Usually the argument is it'll be maintained
> better in Spark but it's well maintained. It adds some overhead to
> maintaining Spark conversely. On the upside it makes it a little more
> discoverable. Are there more 'synergies'?
>
> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I would like to start the discussion on supporting pandas API layer on
>> Spark.
>>
>>
>>
>> If we have a general consensus on having it in PySpark, I will initiate
>> and drive an SPIP with a detailed explanation about the implementation’s
>> overview and structure.
>>
>> I would appreciate it if I can know whether you guys support this or not
>> before starting the SPIP.
>> What do you want to propose?
>>
>> I have been working on the Koalas <https://github.com/databricks/koalas>
>> project that is essentially: pandas API support on Spark, and I would like
>> to propose embracing Koalas in PySpark.
>>
>>
>>
>> More specifically, I am thinking about adding a separate package, to
>> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
>> the existing codes. The overview would look as below:
>>
>> pyspark_dataframe.[... PySpark APIs ...]
>> pandas_dataframe.[... pandas APIs (local) ...]
>>
>> # The package names will change in the final proposal and during review.
>> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
>> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
>> koalas_dataframe.[... pandas APIs on Spark ...]
>>
>> pyspark_dataframe = koalas_dataframe.to_spark()
>> pandas_dataframe = koalas_dataframe.to_pandas()
>>
>> Koalas provides a pandas API layer on PySpark. It supports almost the
>> same API usages. Users can leverage their existing Spark cluster to scale
>> their pandas workloads. It works interchangeably with PySpark by allowing
>> both pandas and PySpark APIs to users.
>>
>> The project has grown separately more than two years, and this has been
>> successfully going. With version 1.7.0 Koalas has greatly improved maturity
>> and stability. Its usability has been proven with numerous users’ adoptions
>> and by reaching more than 75% API coverage in pandas’ Index, Series and
>> DataFrame.
>>
>> I strongly think this is the direction we should go for Apache Spark, and
>> it is a win-win strategy for the growth of both Apache Sp

[DISCUSS] Support pandas API layer on PySpark

2021-03-13 Thread Hyukjin Kwon
Hi all,

I would like to start the discussion on supporting pandas API layer on
Spark.



If we have a general consensus on having it in PySpark, I will initiate and
drive an SPIP with a detailed explanation about the implementation’s
overview and structure.

I would appreciate it if I can know whether you guys support this or not
before starting the SPIP.
What do you want to propose?

I have been working on the Koalas 
project that is essentially: pandas API support on Spark, and I would like
to propose embracing Koalas in PySpark.



More specifically, I am thinking about adding a separate package, to
PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same
API usages. Users can leverage their existing Spark cluster to scale their
pandas workloads. It works interchangeably with PySpark by allowing both
pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been
successfully going. With version 1.7.0 Koalas has greatly improved maturity
and stability. Its usability has been proven with numerous users’ adoptions
and by reaching more than 75% API coverage in pandas’ Index, Series and
DataFrame.

I strongly think this is the direction we should go for Apache Spark, and
it is a win-win strategy for the growth of both Apache Spark and pandas.
Please see the reasons below.
Why do we need it?

   -

   Python has grown dramatically in the last few years and became one of
   the most popular languages, see also StackOverFlow trend
   
   for Python, Java, R and Scala languages.
   -

   pandas became almost the standard library of data science. Please also
   see the StackOverFlow trend
   
   for pandas, Apache Spark and PySpark.
   -

   PySpark is not Pythonic enough. At least I myself hear a lot of
   complaints. That initiated Project Zen
   , and we have greatly
   improved PySpark usability and made it more Pythonic.

Nevertheless, data scientists tend to prefer pandas libraries according to
the trends but APIs are hard to change in PySpark. We should redesign all
APIs and improve them from scratch, which is very difficult.

One straightforward and fast approach is to benchmark a successful case,
and pandas does not support distributed execution. Once PySpark supports
pandas-like APIs, it can be a good option for pandas users to scale their
workloads easily. I do believe this is a win-win strategy for the growth of
both pandas and PySpark.

In fact, there are already similar tries such as Dask 
and Modin  (other than Koalas
). They are all growing fast and
successfully, and I find that people compare it to PySpark from time to
time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data
technologies battling head to head

.



   -

   There are many important features missing that are very common in data
   science. One of the most important features is plotting and drawing a
   chart. Almost every data scientist plots and draws a chart to understand
   their data quickly and visually in their daily work but this is missing in
   PySpark. Please see one example in pandas:




I do recommend taking a quick look for blog posts and talks made for pandas
on Spark:
https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
They explain why we need this far more better.


Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-11 Thread Hyukjin Kwon
+1

2021년 3월 12일 (금) 오후 2:54, Jungtaek Lim 님이 작성:

> +1 (non-binding) Excellent description on SPIP doc! Thanks for the amazing
> effort!
>
> On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh  wrote:
>
>>
>> +1 (non-binding).
>>
>> Thanks for the work!
>>
>>
>> Erik Krogen wrote
>> > +1 from me (non-binding)
>> >
>> > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao 
>>
>> > huaxin.gao11@
>>
>> >  wrote:
>> >
>> >> +1 (non-binding)
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: Apache Spark 3.2 Expectation

2021-03-11 Thread Hyukjin Kwon
Just for an update, I will send a discussion email about my idea late this
week or early next week.

2021년 3월 11일 (목) 오후 7:00, Wenchen Fan 님이 작성:

> There are many projects going on right now, such as new DS v2 APIs, ANSI
> interval types, join improvement, disaggregated shuffle, etc. I don't
> think it's realistic to do the branch cut in April.
>
> I'm +1 to release 3.2 around July, but it doesn't mean we have to cut the
> branch 3 months earlier. We should make the release process faster and cut
> the branch around June probably.
>
>
>
> On Thu, Mar 11, 2021 at 4:41 AM Xiao Li  wrote:
>
>> Below are some nice-to-have features we can work on in Spark 3.2: Lateral
>> Join support ,
>> interval data type, timestamp without time zone, un-nesting arbitrary
>> queries, the returned metrics of DSV2, and error message standardization.
>> Spark 3.2 will be another exciting release I believe!
>>
>> Go Spark!
>>
>> Xiao
>>
>>
>>
>>
>> Dongjoon Hyun  于2021年3月10日周三 下午12:25写道:
>>
>>> Hi, Xiao.
>>>
>>> This thread started 13 days ago. Since you asked the community about
>>> major features or timelines at that time, could you share your roadmap or
>>> expectations if you have something in your mind?
>>>
>>> > Thank you, Dongjoon, for initiating this discussion. Let us keep it
>>> open. It might take 1-2 weeks to collect from the community all the
>>> features we plan to build and ship in 3.2 since we just finished the 3.1
>>> voting.
>>> > TBH, cutting the branch this April does not look good to me. That
>>> means, we only have one month left for feature development of Spark 3.2. Do
>>> we have enough features in the current master branch? If not, are we able
>>> to finish major features we collected here? Do they have a timeline or
>>> project plan?
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>>
>>> On Wed, Mar 3, 2021 at 2:58 PM Dongjoon Hyun 
>>> wrote:
>>>
 Hi, John.

 This thread aims to share your expectations and goals (and maybe work
 progress) to Apache Spark 3.2 because we are making this together. :)

 Bests,
 Dongjoon.


 On Wed, Mar 3, 2021 at 1:59 PM John Zhuge  wrote:

> Hi Dongjoon,
>
> Is it possible to get ViewCatalog in? The community already had fairly
> detailed discussions.
>
> Thanks,
> John
>
> On Thu, Feb 25, 2021 at 8:57 AM Dongjoon Hyun 
> wrote:
>
>> Hi, All.
>>
>> Since we have been preparing Apache Spark 3.2.0 in master branch
>> since December 2020, March seems to be a good time to share our thoughts
>> and aspirations on Apache Spark 3.2.
>>
>> According to the progress on Apache Spark 3.1 release, Apache Spark
>> 3.2 seems to be the last minor release of this year. Given the timeframe,
>> we might consider the following. (This is a small set. Please add your
>> thoughts to this limited list.)
>>
>> # Languages
>>
>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>> and investigating the publishing issue. Thank you for your contributions
>> and feedback on this.
>>
>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017.
>> Like Java 11, we need lots of support from our dependencies. Let's see.
>>
>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>> 2021-12-23. So, the deprecation is not required yet, but we had better
>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>
>> - SparkR CRAN publishing: As we know, it's discontinued so far.
>> Resuming it depends on the success of Apache SparkR 3.1.1 CRAN 
>> publishing.
>> If it succeeds to revive it, we can keep publishing. Otherwise, I believe
>> we had better drop it from the releasing work item list officially.
>>
>> # Dependencies
>>
>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop
>> profile in Apache Spark 3.1. Currently, Spark master branch lives on 
>> Hadoop
>> 3.2.2's shaded clients via SPARK-33212. So far, there is one on-going
>> report at YARN environment. We hope it will be fixed soon at Spark 3.2
>> timeframe and we can move toward Hadoop 3.3.2.
>>
>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile 
>> completely
>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>> official dependency via SPARK-32981. We are steadily improving this area
>> and will consume Hive 2.3.9 if available.
>>
>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>> support K8s model 1.19.
>>
>> - Kafka Client 2.8: To bring the client fixes, Spark 

  1   2   3   4   5   6   >