Re: [ANNOUNCE] Apache Spark 3.0.3 released

2021-06-27 Thread Takeshi Yamamuro
Thank you, Yi ~

On Sat, Jun 26, 2021 at 7:44 AM L. C. Hsieh  wrote:

> Thanks Yi for the work!
>
> On 2021/06/25 05:51:38, Yi Wu  wrote:
> > We are happy to announce the availability of Spark 3.0.3!
> >
> > Spark 3.0.3 is a maintenance release containing stability fixes. This
> > release is based on the branch-3.0 maintenance branch of Spark. We
> strongly
> > recommend all 3.0 users to upgrade to this stable release.
> >
> > To download Spark 3.0.3, head over to the download page:
> > https://spark.apache.org/downloads.html
> >
> > To view the release notes:
> > https://spark.apache.org/releases/spark-release-3-0-3.html
> >
> > We would like to acknowledge all community members for contributing to
> this
> > release. This release would not have been possible without you.
> >
> > Yi
> >
>
> ---------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 3.0.3 (RC1)

2021-06-21 Thread Takeshi Yamamuro
ist of bug fixes going into 3.0.3 can be found at the
>>>>>> following URL:
>>>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12349723
>>>>>> >
>>>>>> > This release is using the release script of the tag v3.0.3-rc1.
>>>>>> >
>>>>>> > FAQ
>>>>>> >
>>>>>> > =
>>>>>> > How can I help test this release?
>>>>>> > =
>>>>>> >
>>>>>> > If you are a Spark user, you can help us test this release by taking
>>>>>> > an existing Spark workload and running on this release candidate,
>>>>>> then
>>>>>> > reporting any regressions.
>>>>>> >
>>>>>> > If you're working in PySpark you can set up a virtual env and
>>>>>> install
>>>>>> > the current RC and see if anything important breaks, in the
>>>>>> Java/Scala
>>>>>> > you can add the staging repository to your projects resolvers and
>>>>>> test
>>>>>> > with the RC (make sure to clean up the artifact cache before/after
>>>>>> so
>>>>>> > you don't end up building with a out of date RC going forward).
>>>>>> >
>>>>>> > ===
>>>>>> > What should happen to JIRA tickets still targeting 3.0.3?
>>>>>> > ===
>>>>>> >
>>>>>> > The current list of open tickets targeted at 3.0.3 can be found at:
>>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>>> "Target
>>>>>> > Version/s" = 3.0.3
>>>>>> >
>>>>>> > Committers should look at those and triage. Extremely important bug
>>>>>> > fixes, documentation, and API tweaks that impact compatibility
>>>>>> should
>>>>>> > be worked on immediately. Everything else please retarget to an
>>>>>> > appropriate release.
>>>>>> >
>>>>>> > ==
>>>>>> > But my bug isn't fixed?
>>>>>> > ==
>>>>>> >
>>>>>> > In order to make timely releases, we will typically not hold the
>>>>>> > release unless the bug in question is a regression from the previous
>>>>>> > release. That being said, if there is something which is a
>>>>>> regression
>>>>>> > that has not been correctly targeted please ping me or a committer
>>>>>> to
>>>>>> > help target the issue.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>>>>
>>>>>> -
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Apache Spark 3.0.3 Release?

2021-06-08 Thread Takeshi Yamamuro
+1. Thank you, Yi ~

Bests,
Takeshi

On Wed, Jun 9, 2021 at 12:18 PM Mridul Muralidharan 
wrote:

>
> +1
>
> Regards,
> Mridul
>
> On Tue, Jun 8, 2021 at 10:11 PM Hyukjin Kwon  wrote:
>
>> Yeah, +1
>>
>> 2021년 6월 9일 (수) 오후 12:06, Yi Wu 님이 작성:
>>
>>> Hi, All.
>>>
>>> Since Apache Spark 3.0.2 tag creation (Feb 16),
>>> new 119 patches (92 issues
>>> <https://issues.apache.org/jira/projects/SPARK/versions/12349723>
>>> resolved) arrived at branch-3.0.
>>>
>>> Shall we make a new release, Apache Spark 3.0.3, as the 3rd release at
>>> the 3.0 line?
>>> I'd like to volunteer as the release manager for Apache Spark 3.0.3.
>>> I'm thinking about starting the first RC at the end of this week.
>>>
>>> $ git log --oneline v3.0.2..HEAD | wc -l
>>>  119
>>>
>>> # Known correctness issues
>>> SPARK-34534 <https://issues.apache.org/jira/browse/SPARK-34534> New
>>> protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or
>>> correctness
>>> SPARK-34545 <https://issues.apache.org/jira/browse/SPARK-34545>
>>> PySpark Python UDF return inconsistent results when applying 2 UDFs with
>>> different return type to 2 columns together
>>> SPARK-34719 <https://issues.apache.org/jira/browse/SPARK-34719>
>>> fail if the view query has duplicated column names
>>> SPARK-34794 <https://issues.apache.org/jira/browse/SPARK-34794>
>>> Nested higher-order functions broken in DSL
>>>
>>> # Notable user-facing changes
>>> SPARK-32924 <https://issues.apache.org/jira/browse/SPARK-32924> Web
>>> UI sort on duration is wrong
>>> SPARK-35405 <https://issues.apache.org/jira/browse/SPARK-35405>
>>>  Submitting Applications documentation has outdated information about K8s
>>> client mode support
>>>
>>> Thanks,
>>> Yi
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Apache Spark 3.1.2 released

2021-06-01 Thread Takeshi Yamamuro
Thank you, Dongjoon!

On Wed, Jun 2, 2021 at 2:29 PM Xiao Li  wrote:

> Thank you!
>
> Xiao
>
> On Tue, Jun 1, 2021 at 9:29 PM Hyukjin Kwon  wrote:
>
>> awesome!
>>
>> 2021년 6월 2일 (수) 오전 9:59, Dongjoon Hyun 님이 작성:
>>
>>> We are happy to announce the availability of Spark 3.1.2!
>>>
>>> Spark 3.1.2 is a maintenance release containing stability fixes. This
>>> release is based on the branch-3.1 maintenance branch of Spark. We
>>> strongly
>>> recommend all 3.1 users to upgrade to this stable release.
>>>
>>> To download Spark 3.1.2, head over to the download page:
>>> https://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-1-2.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>> Dongjoon Hyun
>>>
>>
>
> --
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 3.1.2 (RC1)

2021-05-25 Thread Takeshi Yamamuro
+1 (non-binding)

I ran the tests, checked the related jira tickets, and compared TPCDS
performance differences between
this v3.1.2 candidate and v3.1.1.
Everything looks fine.

Thank you, Dongjoon!


On Wed, May 26, 2021 at 2:32 AM Gengliang Wang  wrote:

> SGTM. Thanks for the work!
>
> +1 (non-binding)
>
> On Wed, May 26, 2021 at 1:28 AM Dongjoon Hyun 
> wrote:
>
>> Thank you, Sean and Gengliang.
>>
>> To Gengliang, it looks not that serious to me because that's a doc-only
>> issue which also can be mitigated simply by updating `facetFilters` from
>> htmls after release.
>>
>> Bests,
>> Dongjoon.
>>
>>
>> On Tue, May 25, 2021 at 9:45 AM Gengliang Wang  wrote:
>>
>>> Hi Dongjoon,
>>>
>>> After Spark 3.1.1, we need an extra step for updating the DocSearch
>>> version index in the release process. I didn't expect Spark 3.1.2 to come
>>> at this time so I haven't updated the release process
>>> <https://github.com/apache/spark-website/pull/344> until yesterday.
>>> I think we should use the latest branch-3.1 to regenerate the Spark
>>> documentation. See https://github.com/apache/spark/pull/32654 for
>>> details. I have also enhanced the release process script
>>> <https://github.com/apache/spark/pull/32662> for this.
>>>
>>> Thanks
>>> Gengliang
>>>
>>>
>>>
>>>
>>> On Tue, May 25, 2021 at 11:31 PM Sean Owen  wrote:
>>>
>>>> +1 same result as in previous tests
>>>>
>>>> On Mon, May 24, 2021 at 1:14 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>> version 3.1.2.
>>>>>
>>>>> The vote is open until May 27th 1AM (PST) and passes if a majority +1
>>>>> PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>
>>>>> [ ] +1 Release this package as Apache Spark 3.1.2
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see https://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v3.1.2-rc1 (commit
>>>>> de351e30a90dd988b133b3d00fa6218bfcaba8b8):
>>>>> https://github.com/apache/spark/tree/v3.1.2-rc1
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1384/
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.2-rc1-docs/
>>>>>
>>>>> The list of bug fixes going into 3.1.2 can be found at the following
>>>>> URL:
>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12349602
>>>>>
>>>>> This release is using the release script of the tag v3.1.2-rc1.
>>>>>
>>>>> FAQ
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC and see if anything important breaks, in the Java/Scala
>>>>> you can add the staging repository to your projects resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with a out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 3.1.2?
>>>>> ===
>>>>>
>>>>> The current list of open tickets targeted at 3.1.2 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 3.1.2
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==
>>>>> But my bug isn't fixed?
>>>>> ==
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>

-- 
---
Takeshi Yamamuro


Re: Resolves too old JIRAs as incomplete

2021-05-24 Thread Takeshi Yamamuro
😊

On Tue, May 25, 2021 at 11:00 AM Hyukjin Kwon  wrote:

> Awesome, thanks Takeshi!
>
> 2021년 5월 25일 (화) 오전 10:59, Takeshi Yamamuro 님이 작성:
>
>> FYI:
>>
>> Thank you for all the comments.
>> I closed 754 tickets in bulk a few minutes ago.
>> Please let me know if there is any problem.
>>
>> Bests,
>> Takeshi
>>
>> On Fri, May 21, 2021 at 10:29 AM Kent Yao  wrote:
>>
>>> +1,thanks Takeshi
>>>
>>> *Kent Yao *
>>> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
>>> *a spark enthusiast*
>>> *kyuubi <https://github.com/yaooqinn/kyuubi>is a
>>> unified multi-tenant JDBC interface for large-scale data processing and
>>> analytics, built on top of Apache Spark <http://spark.apache.org/>.*
>>> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark
>>> SQL extension which provides SQL Standard Authorization for **Apache
>>> Spark <http://spark.apache.org/>.*
>>> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library
>>> for reading data from and transferring data to Postgres / Greenplum with
>>> Spark SQL and DataFrames, 10~100x faster.*
>>> *itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
>>> brings useful functions from various modern database management systems
>>> to​ **Apache Spark <http://spark.apache.org/>.*
>>>
>>>
>>> On 05/21/2021 07:12, Takeshi Yamamuro  wrote:
>>> Thank you, all~
>>>
>>> okay, so I will close them in bulk next week.
>>> If you have more comments, please let me know here.
>>>
>>> Bests,
>>> Takeshi
>>>
>>> On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
>>> wrote:
>>>
>>>> +1, thanks Takeshi !
>>>>
>>>> Regards,
>>>> Mridul
>>>>
>>>> On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
>>>> wrote:
>>>>
>>>>> Hi, dev,
>>>>>
>>>>> As you know, we have too many open JIRAs now:
>>>>> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
>>>>> Progress", Reopened)'
>>>>>
>>>>> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
>>>>> JIRAs
>>>>> for making the JIRAs manageable.
>>>>>
>>>>> As Hyukjin did the same action two years ago (for details, see:
>>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
>>>>> I'm planning to use a similar JQL below to close them:
>>>>>
>>>>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>>>>> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
>>>>> AND updated <= -52w
>>>>>
>>>>> The total number of matched JIRAs is 741.
>>>>> Or, we might be able to close them more aggressively by removing the
>>>>> version condition:
>>>>>
>>>>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>>>>> updated <= -52w
>>>>>
>>>>> The matched number is 1484 (almost half of the current open JIRAs).
>>>>>
>>>>> If there is no objection, I'd like to do it next week or later.
>>>>> Any thoughts?
>>>>>
>>>>> Bests,
>>>>> Takeshi
>>>>> --
>>>>> ---
>>>>> Takeshi Yamamuro
>>>>>
>>>>
>>>
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
---
Takeshi Yamamuro


Re: Resolves too old JIRAs as incomplete

2021-05-24 Thread Takeshi Yamamuro
FYI:

Thank you for all the comments.
I closed 754 tickets in bulk a few minutes ago.
Please let me know if there is any problem.

Bests,
Takeshi

On Fri, May 21, 2021 at 10:29 AM Kent Yao  wrote:

> +1,thanks Takeshi
>
> *Kent Yao *
> @ Data Science Center, Hangzhou Research Institute, NetEase Corp.
> *a spark enthusiast*
> *kyuubi <https://github.com/yaooqinn/kyuubi>is a unified multi-tenant JDBC
> interface for large-scale data processing and analytics, built on top
> of Apache Spark <http://spark.apache.org/>.*
> *spark-authorizer <https://github.com/yaooqinn/spark-authorizer>A Spark
> SQL extension which provides SQL Standard Authorization for **Apache
> Spark <http://spark.apache.org/>.*
> *spark-postgres <https://github.com/yaooqinn/spark-postgres> A library for
> reading data from and transferring data to Postgres / Greenplum with Spark
> SQL and DataFrames, 10~100x faster.*
> *itatchi <https://github.com/yaooqinn/spark-func-extras>A** library t**hat
> brings useful functions from various modern database management systems
> to​ **Apache Spark <http://spark.apache.org/>.*
>
>
> On 05/21/2021 07:12, Takeshi Yamamuro  wrote:
> Thank you, all~
>
> okay, so I will close them in bulk next week.
> If you have more comments, please let me know here.
>
> Bests,
> Takeshi
>
> On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
> wrote:
>
>> +1, thanks Takeshi !
>>
>> Regards,
>> Mridul
>>
>> On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
>> wrote:
>>
>>> Hi, dev,
>>>
>>> As you know, we have too many open JIRAs now:
>>> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
>>> Progress", Reopened)'
>>>
>>> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
>>> JIRAs
>>> for making the JIRAs manageable.
>>>
>>> As Hyukjin did the same action two years ago (for details, see:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
>>> I'm planning to use a similar JQL below to close them:
>>>
>>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>>> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
>>> AND updated <= -52w
>>>
>>> The total number of matched JIRAs is 741.
>>> Or, we might be able to close them more aggressively by removing the
>>> version condition:
>>>
>>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>>> updated <= -52w
>>>
>>> The matched number is 1484 (almost half of the current open JIRAs).
>>>
>>> If there is no objection, I'd like to do it next week or later.
>>> Any thoughts?
>>>
>>> Bests,
>>> Takeshi
>>> --
>>> ---
>>> Takeshi Yamamuro
>>>
>>
>
> --
> ---
> Takeshi Yamamuro
>
>

-- 
---
Takeshi Yamamuro


Re: Resolves too old JIRAs as incomplete

2021-05-20 Thread Takeshi Yamamuro
Thank you, all~

okay, so I will close them in bulk next week.
If you have more comments, please let me know here.

Bests,
Takeshi

On Fri, May 21, 2021 at 5:05 AM Mridul Muralidharan 
wrote:

> +1, thanks Takeshi !
>
> Regards,
> Mridul
>
> On Wed, May 19, 2021 at 8:48 PM Takeshi Yamamuro 
> wrote:
>
>> Hi, dev,
>>
>> As you know, we have too many open JIRAs now:
>> # of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
>> Progress", Reopened)'
>>
>> We've recently released v2.4.8(EOL), so I'd like to bulk-close too old
>> JIRAs
>> for making the JIRAs manageable.
>>
>> As Hyukjin did the same action two years ago (for details, see:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
>> I'm planning to use a similar JQL below to close them:
>>
>> project = SPARK AND status in (Open, "In Progress", Reopened) AND
>> (affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
>> AND updated <= -52w
>>
>> The total number of matched JIRAs is 741.
>> Or, we might be able to close them more aggressively by removing the
>> version condition:
>>
>> project = SPARK AND status in (Open, "In Progress", Reopened) AND updated
>> <= -52w
>>
>> The matched number is 1484 (almost half of the current open JIRAs).
>>
>> If there is no objection, I'd like to do it next week or later.
>> Any thoughts?
>>
>> Bests,
>> Takeshi
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
---
Takeshi Yamamuro


Resolves too old JIRAs as incomplete

2021-05-19 Thread Takeshi Yamamuro
Hi, dev,

As you know, we have too many open JIRAs now:
# of open JIRAs=2698: JQL='project = SPARK AND status in (Open, "In
Progress", Reopened)'

We've recently released v2.4.8(EOL), so I'd like to bulk-close too old JIRAs
for making the JIRAs manageable.

As Hyukjin did the same action two years ago (for details, see:
http://apache-spark-developers-list.1001551.n3.nabble.com/Resolving-all-JIRAs-affecting-EOL-releases-td27838.html),
I'm planning to use a similar JQL below to close them:

project = SPARK AND status in (Open, "In Progress", Reopened) AND
(affectedVersion = EMPTY OR NOT (affectedVersion in versionMatch("^3.*")))
AND updated <= -52w

The total number of matched JIRAs is 741.
Or, we might be able to close them more aggressively by removing the
version condition:

project = SPARK AND status in (Open, "In Progress", Reopened) AND updated
<= -52w

The matched number is 1484 (almost half of the current open JIRAs).

If there is no objection, I'd like to do it next week or later.
Any thoughts?

Bests,
Takeshi
-- 
---
Takeshi Yamamuro


Re: WholeStageCodeGen + DSv2

2021-05-19 Thread Takeshi Yamamuro
hi, Andrew,

Welcome any improvement proposal for that.
Could you file an issue in jira first to show us your idea and an example
query
to reproduce the issue you described?

Bests,
Takeshi

On Wed, May 19, 2021 at 11:38 AM Andrew Melo  wrote:

> Hello,
>
> When reading a very wide (> 1000 cols) input, WholeStageCodeGen blows
> past the 64kB source limit and fails. Looking at the generated code, a
> big part of the code is simply the DSv2 convention that the codegen'd
> variable names are the same as the columns instead of something more
> compact like 'c1', 'c2', etc..
>
> Would there be any interest in accepting a patch that shortens these
> variable names to try and stay under the limit?
>
> Thanks
> Andrew
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Apache Spark 2.4.8 released

2021-05-17 Thread Takeshi Yamamuro
Thank you for the release work, Liang-Chi~

On Tue, May 18, 2021 at 2:11 PM Hyukjin Kwon  wrote:

> Yay!
>
> 2021년 5월 18일 (화) 오후 12:57, Liang-Chi Hsieh 님이 작성:
>
>> We are happy to announce the availability of Spark 2.4.8!
>>
>> Spark 2.4.8 is a maintenance release containing stability, correctness,
>> and
>> security fixes.
>> This release is based on the branch-2.4 maintenance branch of Spark. We
>> strongly recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.8, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or to use
>> `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2-4-8.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
---
Takeshi Yamamuro


Re: Apache Spark 3.1.2 Release?

2021-05-17 Thread Takeshi Yamamuro
Thank you, Dongjoon~ sgtm, too.

On Tue, May 18, 2021 at 7:34 AM Cheng Su  wrote:

> +1 for a new release, thanks Dongjoon!
>
> Cheng Su
>
> On 5/17/21, 2:44 PM, "Liang-Chi Hsieh"  wrote:
>
> +1 sounds good. Thanks Dongjoon for volunteering on this!
>
>
> Liang-Chi
>
>
> Dongjoon Hyun-2 wrote
> > Hi, All.
> >
> > Since Apache Spark 3.1.1 tag creation (Feb 21),
> > new 172 patches including 9 correctness patches and 4 K8s patches
> arrived
> > at branch-3.1.
> >
> > Shall we make a new release, Apache Spark 3.1.2, as the second
> release at
> > 3.1 line?
> > I'd like to volunteer for the release manager for Apache Spark 3.1.2.
> > I'm thinking about starting the first RC next week.
> >
> > $ git log --oneline v3.1.1..HEAD | wc -l
> >  172
> >
> > # Known correctness issues
> > SPARK-34534 New protocol FetchShuffleBlocks in
> OneForOneBlockFetcher
> > lead to data loss or correctness
> > SPARK-34545 PySpark Python UDF return inconsistent results when
> > applying 2 UDFs with different return type to 2 columns together
> > SPARK-34681 Full outer shuffled hash join when building left side
> > produces wrong result
> > SPARK-34719 fail if the view query has duplicated column names
> > SPARK-34794 Nested higher-order functions broken in DSL
> > SPARK-34829 transform_values return identical values when it's
> used
> > with udf that returns reference type
> > SPARK-34833 Apply right-padding correctly for correlated
> subqueries
> > SPARK-35381 Fix lambda variable name issues in nested DataFrame
> > functions in R APIs
> > SPARK-35382 Fix lambda variable name issues in nested DataFrame
> > functions in Python APIs
> >
> > # Notable K8s patches since K8s GA
> > SPARK-34674Close SparkContext after the Main method has finished
> > SPARK-34948Add ownerReference to executor configmap to fix
> leakages
> > SPARK-34820add apt-update before gnupg install
> > SPARK-34361In case of downscaling avoid killing of executors
> already
> > known by the scheduler backend in the pod allocator
> >
> > Bests,
> > Dongjoon.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 2.4.8 (RC4)

2021-05-09 Thread Takeshi Yamamuro
+1 (non-binding)

I don't see any critical issue in the SQL part.
Many thanks for working on it, Liang-Chi~

On Mon, May 10, 2021 at 6:22 AM Liang-Chi Hsieh  wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.4.8.
>
> The vote is open until May 14th at 9AM PST and passes if a majority +1 PMC
> votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.4.8
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> There are currently no issues targeting 2.4.8 (try project = SPARK AND
> "Target Version/s" = "2.4.8" AND status in (Open, Reopened, "In Progress"))
>
> The tag to be voted on is v2.4.8-rc4 (commit
> 163fbd2528a18bf062bddf7b7753631a12a369b5):
> https://github.com/apache/spark/tree/v2.4.8-rc4
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc4-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1383/
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.4.8-rc4-docs/
>
> The list of bug fixes going into 2.4.8 can be found at the following URL:
> https://s.apache.org/spark-v2.4.8-rc4
>
> This release is using the release script of the tag v2.4.8-rc4.
>
> FAQ
>
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with an out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.4.8?
> ===
>
> The current list of open tickets targeted at 2.4.8 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.8
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 2.4.8 (RC1)

2021-04-07 Thread Takeshi Yamamuro
Thanks for driving this, Liang-Chi~
IIUC there is no critical issue in the SQL part, so it looks fine.

+1 (non-binding)

On Thu, Apr 8, 2021 at 11:20 AM Wenchen Fan  wrote:

> +1
>
> On Thu, Apr 8, 2021 at 9:24 AM Sean Owen  wrote:
>
>> Looks good to me testing on Java 8, Hadoop 2.7, Ubuntu, with about all
>> profiles enabled.
>> I still get an odd failure in the Hive versions suite, but I keep seeing
>> that in my env and think it's something odd about my setup.
>> +1
>>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] SPIP: Support pandas API layer on PySpark

2021-03-26 Thread Takeshi Yamamuro
+1 (non-binding)

On Sat, Mar 27, 2021 at 4:53 AM Liang-Chi Hsieh  wrote:

> +1 (non-binding)
>
>
> rxin wrote
> > +1. Would open up a huge persona for Spark.
> >
> > On Fri, Mar 26 2021 at 11:30 AM, Bryan Cutler <
>
> > cutlerb@
>
> >  > wrote:
> >
> >>
> >> +1 (non-binding)
> >>
> >>
> >> On Fri, Mar 26, 2021 at 9:49 AM Maciej <
>
> > mszymkiewicz@
>
> >  > wrote:
> >>
> >>
> >>> +1 (nonbinding)
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Welcoming six new Apache Spark committers

2021-03-26 Thread Takeshi Yamamuro
Congrats, all~

On Sat, Mar 27, 2021 at 7:46 AM Jungtaek Lim 
wrote:

> Congrats all!
>
> 2021년 3월 27일 (토) 오전 6:56, Liang-Chi Hsieh 님이 작성:
>
>> Congrats! Welcome!
>>
>>
>> Matei Zaharia wrote
>> > Hi all,
>> >
>> > The Spark PMC recently voted to add several new committers. Please join
>> me
>> > in welcoming them to their new role! Our new committers are:
>> >
>> > - Maciej Szymkiewicz (contributor to PySpark)
>> > - Max Gekk (contributor to Spark SQL)
>> > - Kent Yao (contributor to Spark SQL)
>> > - Attila Zsolt Piros (contributor to decommissioning and Spark on
>> > Kubernetes)
>> > - Yi Wu (contributor to Spark Core and SQL)
>> > - Gabor Somogyi (contributor to Streaming and security)
>> >
>> > All six of them contributed to Spark 3.1 and we’re very excited to have
>> > them join as committers.
>> >
>> > Matei and the Spark PMC
>> > -
>> > To unsubscribe e-mail:
>>
>> > dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
---
Takeshi Yamamuro


Re: [build system] short downtime today, new workers coming soon

2021-03-23 Thread Takeshi Yamamuro
Thanks, Shane!
Everything seems to be going well.

Bests,
Takeshi

On Wed, Mar 24, 2021 at 5:23 AM shane knapp ☠  wrote:

> we're back!
>
> On Tue, Mar 23, 2021 at 12:31 PM shane knapp ☠ 
> wrote:
>
>> jenkins is acting up, and i'm going to take the opportunity to reboot the
>> primary and all the workers.
>>
>> sorry for the short notice, but on the bright side we have a bunch of
>> shiny new workers coming soon!
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Support pandas API layer on PySpark

2021-03-16 Thread Takeshi Yamamuro
+1; the pandas interfaces are pretty popular and supporting them in pyspark
looks promising, I think.
one question I have; what's an initial goal of the proposal?
Is that to port all the pandas interfaces that Koalas has already
implemented?
Or, the basic set of them?

On Tue, Mar 16, 2021 at 1:44 AM Ismaël Mejía  wrote:

> +1
>
> Bringing a Pandas API for pyspark to upstream Spark will only bring
> benefits for everyone (more eyes to use/see/fix/improve the API) as
> well as better alignment with core Spark improvements, the extra
> weight looks manageable.
>
> On Mon, Mar 15, 2021 at 4:45 PM Nicholas Chammas
>  wrote:
> >
> > On Mon, Mar 15, 2021 at 2:12 AM Reynold Xin  wrote:
> >>
> >> I don't think we should deprecate existing APIs.
> >
> >
> > +1
> >
> > I strongly prefer Spark's immutable DataFrame API to the Pandas API. I
> could be wrong, but I wager most people who have worked with both Spark and
> Pandas feel the same way.
> >
> > For the large community of current PySpark users, or users switching to
> PySpark from another Spark language API, it doesn't make sense to deprecate
> the current API, even by convention.
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-15 Thread Takeshi Yamamuro
Hi, viirya

I'm looking now into "SPARK-34607: Add `Utils.isMemberClass` to fix a
malformed class name error
on jdk8u" .

Bests,
Takeshi

On Tue, Mar 16, 2021 at 4:45 AM Liang-Chi Hsieh  wrote:

> To update with current status.
>
> There are three tickets targeting 2.4 that are still ongoing.
>
> SPARK-34719: Correctly resolve the view query with duplicated column names
> SPARK-34607: Add `Utils.isMemberClass` to fix a malformed class name error
> on jdk8u
> SPARK-34726: Fix collectToPython timeouts
>
> SPARK-34719 doesn't have PR for 2.4 yet.
>
> SPARK-34607 and SPARK-34726 are under review. SPARK-34726 is a bit arguable
> as it involves a behavior change even it is very rare case. Welcome any
> suggestion on the PR if any. Thanks.
>
>
>
> Dongjoon Hyun-2 wrote
> > Thank you for the update.
> >
> > +1 for your plan.
> >
> > Bests,
> > Dongjoon.
>
>
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] SPIP: Add FunctionCatalog

2021-03-12 Thread Takeshi Yamamuro
+1, too.

On Fri, Mar 12, 2021 at 8:51 PM kordex  wrote:

> +1 (for what it's worth). It will definitely help our efforts.
>
> On Fri, Mar 12, 2021 at 12:14 PM Gengliang Wang  wrote:
> >
> > +1 (non-binding)
> >
> > On Fri, Mar 12, 2021 at 3:00 PM Hyukjin Kwon 
> wrote:
> >>
> >> +1
> >>
> >> 2021년 3월 12일 (금) 오후 2:54, Jungtaek Lim 님이
> 작성:
> >>>
> >>> +1 (non-binding) Excellent description on SPIP doc! Thanks for the
> amazing effort!
> >>>
> >>> On Wed, Mar 10, 2021 at 3:19 AM Liang-Chi Hsieh 
> wrote:
> >>>>
> >>>>
> >>>> +1 (non-binding).
> >>>>
> >>>> Thanks for the work!
> >>>>
> >>>>
> >>>> Erik Krogen wrote
> >>>> > +1 from me (non-binding)
> >>>> >
> >>>> > On Tue, Mar 9, 2021 at 9:27 AM huaxin gao <
> >>>>
> >>>> > huaxin.gao11@
> >>>>
> >>>> > > wrote:
> >>>> >
> >>>> >> +1 (non-binding)
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
> >>>>
> >>>> -
> >>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Apache Spark 2.4.8 (and EOL of 2.4)

2021-03-03 Thread Takeshi Yamamuro
+1 for releasing 2.4.8 and thanks, Liang-chi, for volunteering.
Btw, anyone roughly know how many v2.4 users still are based on some stats
(e.g., # of v2.4.7 downloads from the official repos)?
Most users have started using v3.x?

On Thu, Mar 4, 2021 at 8:34 AM Hyukjin Kwon  wrote:

> Yeah, I would prefer to have a 2.4.8 release as an EOL too. I don't mind
> having 2.4.9 as EOL too if that's preferred from more people.
>
> 2021년 3월 4일 (목) 오전 4:01, Sean Owen 님이 작성:
>
>> Sure, I'm even arguing that 2.4.8 could possibly be the final release. No
>> objection of course to continuing to backport to 2.4.x where appropriate
>> and cutting 2.4.9 later in the year as a final EOL release, either.
>>
>> On Wed, Mar 3, 2021 at 12:59 PM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you, Sean.
>>>
>>> Ya, exactly, we can release 2.4.8 as a normal release first and use
>>> 2.4.9 as the EOL release.
>>>
>>> Since 2.4.7 was released almost 6 months ago, 2.4.8 is a little late in
>>> terms of the cadence.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Wed, Mar 3, 2021 at 10:55 AM Sean Owen  wrote:
>>>
>>>> For reference, 2.3.x was maintained from February 2018 (2.3.0) to Sep
>>>> 2019 (2.3.4), or about 19 months. The 2.4 branch should probably be
>>>> maintained longer than that, as the final 2.x branch. 2.4.0 was released in
>>>> Nov 2018. A final release in, say, April 2021 would be about 30 months.
>>>> That feels about right timing-wise.
>>>>
>>>> We should in any event release 2.4.8, yes. We can of course choose to
>>>> release a 2.4.9 if some critical issue is found, later.
>>>>
>>>> But yeah based on the velocity of back-ports to 2.4.x, it seems about
>>>> time to call it EOL.
>>>>
>>>> Sean
>>>>
>>>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-02 Thread Takeshi Yamamuro
Congrats, all!

Bests,
Takeshi

On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan  wrote:

>
> Thanks Hyukjin and congratulations everyone on the release !
>
> Regards,
> Mridul
>
> On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang  wrote:
>
>> Great work, Hyukjin!
>>
>> On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon  wrote:
>>
>>> We are excited to announce Spark 3.1.1 today.
>>>
>>> Apache Spark 3.1.1 is the second release of the 3.x line. This release
>>> adds
>>> Python type annotations and Python dependency management support as part
>>> of Project Zen.
>>> Other major updates include improved ANSI SQL compliance support,
>>> history server support
>>> in structured streaming, the general availability (GA) of Kubernetes and
>>> node decommissioning
>>> in Kubernetes and Standalone. In addition, this release continues to
>>> focus on usability, stability,
>>> and polish while resolving around 1500 tickets.
>>>
>>> We'd like to thank our contributors and users for their contributions
>>> and early feedback to
>>> this release. This release would not have been possible without you.
>>>
>>> To download Spark 3.1.1, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-3-1-1.html
>>>
>>>

-- 
---
Takeshi Yamamuro


Re: Apache Spark 3.2 Expectation

2021-02-28 Thread Takeshi Yamamuro
Thanks, Dongjoon, for the discussion.
I would like to add Gengliang's work: SPARK-34246 New type coercion syntax
rules in ANSI mode
I think it is worth describing it in the next release note, too.

Bests,
Takeshi

On Sat, Feb 27, 2021 at 11:41 AM Yi Wu  wrote:

> +1 to continue the incompleted push-based shuffle.
>
> --
> Yi
>
> On Fri, Feb 26, 2021 at 1:26 AM Mridul Muralidharan 
> wrote:
>
>>
>>
>> Nit: Java 17 -> should be available by Sept 2021 :-)
>> Adoption would also depend on some of our nontrivial dependencies
>> supporting it - it might be a stretch to get it in for Apache Spark 3.2 ?
>>
>> Features:
>> Push based shuffle and disaggregated shuffle should also be in 3.2
>>
>>
>> Regards,
>> Mridul
>>
>>
>>
>>
>>
>>
>> On Thu, Feb 25, 2021 at 10:57 AM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Since we have been preparing Apache Spark 3.2.0 in master branch since
>>> December 2020, March seems to be a good time to share our thoughts and
>>> aspirations on Apache Spark 3.2.
>>>
>>> According to the progress on Apache Spark 3.1 release, Apache Spark 3.2
>>> seems to be the last minor release of this year. Given the timeframe, we
>>> might consider the following. (This is a small set. Please add your
>>> thoughts to this limited list.)
>>>
>>> # Languages
>>>
>>> - Scala 2.13 Support: This was expected on 3.1 via SPARK-25075 but
>>> slipped out. Currently, we are trying to use Scala 2.13.5 via SPARK-34505
>>> and investigating the publishing issue. Thank you for your contributions
>>> and feedback on this.
>>>
>>> - Java 17 LTS Support: Java 17 LTS will arrive in September 2017. Like
>>> Java 11, we need lots of support from our dependencies. Let's see.
>>>
>>> - Python 3.6 Deprecation(?): Python 3.6 community support ends at
>>> 2021-12-23. So, the deprecation is not required yet, but we had better
>>> prepare it because we don't have an ETA of Apache Spark 3.3 in 2022.
>>>
>>> - SparkR CRAN publishing: As we know, it's discontinued so far. Resuming
>>> it depends on the success of Apache SparkR 3.1.1 CRAN publishing. If it
>>> succeeds to revive it, we can keep publishing. Otherwise, I believe we had
>>> better drop it from the releasing work item list officially.
>>>
>>> # Dependencies
>>>
>>> - Apache Hadoop 3.3.2: Hadoop 3.2.0 becomes the default Hadoop profile
>>> in Apache Spark 3.1. Currently, Spark master branch lives on Hadoop 3.2.2's
>>> shaded clients via SPARK-33212. So far, there is one on-going report at
>>> YARN environment. We hope it will be fixed soon at Spark 3.2 timeframe and
>>> we can move toward Hadoop 3.3.2.
>>>
>>> - Apache Hive 2.3.9: Spark 3.0 starts to use Hive 2.3.7 by default
>>> instead of old Hive 1.2 fork. Spark 3.1 removed hive-1.2 profile completely
>>> via SPARK-32981 and replaced the generated hive-service-rpc code with the
>>> official dependency via SPARK-32981. We are steadily improving this area
>>> and will consume Hive 2.3.9 if available.
>>>
>>> - K8s Client 4.13.2: During K8s GA activity, Spark 3.1 upgrades K8s
>>> client dependency to 4.12.0. Spark 3.2 upgrades it to 4.13.2 in order to
>>> support K8s model 1.19.
>>>
>>> - Kafka Client 2.8: To bring the client fixes, Spark 3.1 is using Kafka
>>> Client 2.6. For Spark 3.2, SPARK-33913 upgraded to Kafka 2.7 with Scala
>>> 2.12.13, but it was reverted later due to Scala 2.12.13 issue. Since
>>> KAFKA-12357 fixed the Scala requirement two days ago, Spark 3.2 will go
>>> with Kafka Client 2.8 hopefully.
>>>
>>> # Some Features
>>>
>>> - Data Source v2: Spark 3.2 will deliver much richer DSv2 with Apache
>>> Iceberg integration. Especially, we hope the on-going function catalog SPIP
>>> and up-coming storage partitioned join SPIP can be delivered as a part of
>>> Spark 3.2 and become an additional foundation.
>>>
>>> - Columnar Encryption: As of today, Apache Spark master branch supports
>>> columnar encryption via Apache ORC 1.6 and it's documented via SPARK-34036.
>>> Also, upcoming Apache Parquet 1.12 has a similar capability. Hopefully,
>>> Apache Spark 3.2 is going to be the first release to have this feature
>>> officially. Any feedback is welcome.
>>>
>>> - Improved ZStandard Support: Spark 3.2 will bring more benefits for
>>> ZStandard users: 1) SPARK-34340 added native ZSTD JNI buffer pool support
>>> for all IO operations, 2) SPARK-33978 makes ORC datasource support ZSTD
>>> compression, 3) SPARK-34503 sets ZSTD as the default codec for event log
>>> compression, 4) SPARK-34479 aims to support ZSTD at Avro data source. Also,
>>> the upcoming Parquet 1.12 supports ZSTD (and supports JNI buffer pool),
>>> too. I'm expecting more benefits.
>>>
>>> - Structure Streaming with RocksDB backend: According to the latest
>>> update, it looks active enough for merging to master branch in Spark 3.2.
>>>
>>> Please share your thoughts and let's build better Apache Spark 3.2
>>> together.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 3.1.1 (RC3)

2021-02-23 Thread Takeshi Yamamuro
 Spark 3.1.1
>>>>> [ ] -1 Do not release this package because ...
>>>>>
>>>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>>>
>>>>> The tag to be voted on is v3.1.1-rc3 (commit
>>>>> 1d550c4e90275ab418b9161925049239227f3dc9):
>>>>> https://github.com/apache/spark/tree/v3.1.1-rc3
>>>>>
>>>>> The release files, including signatures, digests, etc. can be found at:
>>>>> <https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/>
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/
>>>>>
>>>>> Signatures used for Spark RCs can be found in this file:
>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>
>>>>> The staging repository for this release can be found at:
>>>>> https://repository.apache.org/content/repositories/orgapachespark-1367
>>>>>
>>>>> The documentation corresponding to this release can be found at:
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-docs/
>>>>>
>>>>> The list of bug fixes going into 3.1.1 can be found at the following
>>>>> URL:
>>>>> https://s.apache.org/41kf2
>>>>>
>>>>> This release is using the release script of the tag v3.1.1-rc3.
>>>>>
>>>>> FAQ
>>>>>
>>>>> ===
>>>>> What happened to 3.1.0?
>>>>> ===
>>>>>
>>>>> There was a technical issue during Apache Spark 3.1.0 preparation, and
>>>>> it was discussed and decided to skip 3.1.0.
>>>>> Please see
>>>>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>>>>> more details.
>>>>>
>>>>> =
>>>>> How can I help test this release?
>>>>> =
>>>>>
>>>>> If you are a Spark user, you can help us test this release by taking
>>>>> an existing Spark workload and running on this release candidate, then
>>>>> reporting any regressions.
>>>>>
>>>>> If you're working in PySpark you can set up a virtual env and install
>>>>> the current RC via "pip install
>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc3-bin/pyspark-3.1.1.tar.gz
>>>>> "
>>>>> and see if anything important breaks.
>>>>> In the Java/Scala, you can add the staging repository to your projects
>>>>> resolvers and test
>>>>> with the RC (make sure to clean up the artifact cache before/after so
>>>>> you don't end up building with an out of date RC going forward).
>>>>>
>>>>> ===
>>>>> What should happen to JIRA tickets still targeting 3.1.1?
>>>>> ===
>>>>>
>>>>> The current list of open tickets targeted at 3.1.1 can be found at:
>>>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>>> Version/s" = 3.1.1
>>>>>
>>>>> Committers should look at those and triage. Extremely important bug
>>>>> fixes, documentation, and API tweaks that impact compatibility should
>>>>> be worked on immediately. Everything else please retarget to an
>>>>> appropriate release.
>>>>>
>>>>> ==
>>>>> But my bug isn't fixed?
>>>>> ==
>>>>>
>>>>> In order to make timely releases, we will typically not hold the
>>>>> release unless the bug in question is a regression from the previous
>>>>> release. That being said, if there is something which is a regression
>>>>> that has not been correctly targeted please ping me or a committer to
>>>>> help target the issue.
>>>>>
>>>>>
>
> --
> John Zhuge
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 3.0.2 (RC1)

2021-02-17 Thread Takeshi Yamamuro
+1

I've looked around the jira tickets and I couldn't find any blocker in the
SQL part.
Also, I ran the tests on aws env and I couldn't find any critical error
there, too.


On Wed, Feb 17, 2021 at 5:21 PM John Zhuge  wrote:

> +1 (non-binding)
>
> On Tue, Feb 16, 2021 at 11:11 PM Maxim Gekk 
> wrote:
>
>> +1 (non-binding)
>>
>> On Wed, Feb 17, 2021 at 9:54 AM Wenchen Fan  wrote:
>>
>>> +1
>>>
>>> On Wed, Feb 17, 2021 at 1:43 PM Dongjoon Hyun 
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>>
>>>> On Tue, Feb 16, 2021 at 2:27 AM Herman van Hovell <
>>>> her...@databricks.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Feb 16, 2021 at 11:08 AM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> 2021년 2월 16일 (화) 오후 5:10, Prashant Sharma 님이
>>>>>> 작성:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Tue, Feb 16, 2021 at 1:22 PM Dongjoon Hyun <
>>>>>>> dongjoon.h...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Please vote on releasing the following candidate as Apache Spark
>>>>>>>> version 3.0.2.
>>>>>>>>
>>>>>>>> The vote is open until February 19th 9AM (PST) and passes if a
>>>>>>>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>>>>>>>
>>>>>>>> [ ] +1 Release this package as Apache Spark 3.0.2
>>>>>>>> [ ] -1 Do not release this package because ...
>>>>>>>>
>>>>>>>> To learn more about Apache Spark, please see
>>>>>>>> https://spark.apache.org/
>>>>>>>>
>>>>>>>> The tag to be voted on is v3.0.2-rc1 (commit
>>>>>>>> 648457905c4ea7d00e3d88048c63f360045f0714):
>>>>>>>> https://github.com/apache/spark/tree/v3.0.2-rc1
>>>>>>>>
>>>>>>>> The release files, including signatures, digests, etc. can be found
>>>>>>>> at:
>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/
>>>>>>>>
>>>>>>>> Signatures used for Spark RCs can be found in this file:
>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>>>>>>
>>>>>>>> The staging repository for this release can be found at:
>>>>>>>>
>>>>>>>> https://repository.apache.org/content/repositories/orgapachespark-1366/
>>>>>>>>
>>>>>>>> The documentation corresponding to this release can be found at:
>>>>>>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/
>>>>>>>>
>>>>>>>> The list of bug fixes going into 3.0.2 can be found at the
>>>>>>>> following URL:
>>>>>>>> https://issues.apache.org/jira/projects/SPARK/versions/12348739
>>>>>>>>
>>>>>>>> FAQ
>>>>>>>>
>>>>>>>> =
>>>>>>>> How can I help test this release?
>>>>>>>> =
>>>>>>>>
>>>>>>>> If you are a Spark user, you can help us test this release by taking
>>>>>>>> an existing Spark workload and running on this release candidate,
>>>>>>>> then
>>>>>>>> reporting any regressions.
>>>>>>>>
>>>>>>>> If you're working in PySpark you can set up a virtual env and
>>>>>>>> install
>>>>>>>> the current RC and see if anything important breaks, in the
>>>>>>>> Java/Scala
>>>>>>>> you can add the staging repository to your projects resolvers and
>>>>>>>> test
>>>>>>>> with the RC (make sure to clean up the artifact cache before/after
>>>>>>>> so
>>>>>>>> you don't end up building with a out of date RC going forward).
>>>>>>>>
>>>>>>>> ===
>>>>>>>> What should happen to JIRA tickets still targeting 3.0.2?
>>>>>>>> ===
>>>>>>>>
>>>>>>>> The current list of open tickets targeted at 3.0.2 can be found at:
>>>>>>>> https://issues.apache.org/jira/projects/SPARK and search for
>>>>>>>> "Target Version/s" = 3.0.2
>>>>>>>>
>>>>>>>> Committers should look at those and triage. Extremely important bug
>>>>>>>> fixes, documentation, and API tweaks that impact compatibility
>>>>>>>> should
>>>>>>>> be worked on immediately. Everything else please retarget to an
>>>>>>>> appropriate release.
>>>>>>>>
>>>>>>>> ==
>>>>>>>> But my bug isn't fixed?
>>>>>>>> ==
>>>>>>>>
>>>>>>>> In order to make timely releases, we will typically not hold the
>>>>>>>> release unless the bug in question is a regression from the previous
>>>>>>>> release. That being said, if there is something which is a
>>>>>>>> regression
>>>>>>>> that has not been correctly targeted please ping me or a committer
>>>>>>>> to
>>>>>>>> help target the issue.
>>>>>>>>
>>>>>>>
>
> --
> John Zhuge
>


-- 
---
Takeshi Yamamuro


Re: Apache Spark 3.0.2 Release ?

2021-02-12 Thread Takeshi Yamamuro
+1, too. Thanks, Dongjoon!

> 2021/02/13 11:07、Xiao Li のメール:
> 
> 
> +1 
> 
> Happy Lunar New Year!
> 
> Xiao
> 
>> On Fri, Feb 12, 2021 at 5:33 PM Hyukjin Kwon  wrote:
>> Yeah, +1 too
>> 
>> 2021년 2월 13일 (토) 오전 4:49, Dongjoon Hyun 님이 작성:
>>> Thank you, Sean!
>>> 
 On Fri, Feb 12, 2021 at 11:41 AM Sean Owen  wrote:
 Sounds like a fine time to me, sure.
 
> On Fri, Feb 12, 2021 at 1:39 PM Dongjoon Hyun  
> wrote:
> Hi, All.
> 
> As of today, `branch-3.0` has 307 patches (including 25 correctness 
> patches) since v3.0.1 tag (released on September 8th, 2020).
> 
> Since we stabilized branch-3.0 during 3.1.x preparation so far,
> it would be great if we start to release Apache Spark 3.0.2 next week.
> And, I'd like to volunteer for Apache Spark 3.0.2 release manager.
> 
> What do you think about the Apache Spark 3.0.2 release?
> 
> Bests,
> Dongjoon.
> 
> 
> --
> SPARK-31511 Make BytesToBytesMap iterator() thread-safe
> SPARK-32635 When pyspark.sql.functions.lit() function is used with 
> dataframe cache, it returns wrong result
> SPARK-32753 Deduplicating and repartitioning the same column create 
> duplicate rows with AQE
> SPARK-32764 compare of -0.0 < 0.0 return true
> SPARK-32840 Invalid interval value can happen to be just adhesive with 
> the unit
> SPARK-32908 percentile_approx() returns incorrect results
> SPARK-33019 Use 
> spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=1 by default
> SPARK-33183 Bug in optimizer rule EliminateSorts
> SPARK-33260 SortExec produces incorrect results if sortOrder is a Stream
> SPARK-33290 REFRESH TABLE should invalidate cache even though the table 
> itself may not be cached
> SPARK-33358 Spark SQL CLI command processing loop can't exit while one 
> comand fail
> SPARK-33404 "date_trunc" expression returns incorrect results
> SPARK-33435 DSv2: REFRESH TABLE should invalidate caches
> SPARK-33591 NULL is recognized as the "null" string in partition specs
> SPARK-33593 Vector reader got incorrect data with binary partition value
> SPARK-33726 Duplicate field names causes wrong answers during aggregation
> SPARK-33950 ALTER TABLE .. DROP PARTITION doesn't refresh cache
> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
> SPARK-34055 ALTER TABLE .. ADD PARTITION doesn't refresh cache
> SPARK-34187 Use available offset range obtained during polling when 
> checking offset validation
> SPARK-34212 For parquet table, after changing the precision and scale of 
> decimal type in hive, spark reads incorrect value
> SPARK-34213 LOAD DATA doesn't refresh v1 table cache
> SPARK-34229 Avro should read decimal values with the file schema
> SPARK-34262 ALTER TABLE .. SET LOCATION doesn't refresh v1 table cache
> 
> 
> -- 
> 


Re: [VOTE] Release Spark 3.1.1 (RC2)

2021-02-10 Thread Takeshi Yamamuro
+1

I looked around the jira tickets and I think there is no explicit blocker
issue on the Spark SQL component.
Also, I ran the tests on AWS envs and I couldn't find any issue there, too.

Bests,
Takeshi

On Thu, Feb 11, 2021 at 7:37 AM Mridul Muralidharan 
wrote:

>
> Signatures, digests, etc check out fine.
> Checked out tag and build/tested with -Pyarn -Phadoop-2.7 -Phive
> -Phive-thriftserver -Pmesos -Pkubernetes
>
> I keep getting test failures
> with org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite: removing this
> suite gets the build through though - does anyone have suggestions on how
> to fix it ?
> Perhaps a local problem at my end ?
>
>
> Regards,
> Mridul
>
>
>
> On Mon, Feb 8, 2021 at 6:24 PM Hyukjin Kwon  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version
>> 3.1.1.
>>
>> The vote is open until February 15th 5PM PST and passes if a majority +1
>> PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> Note that it is 7 days this time because it is a holiday season in
>> several countries including South Korea (where I live), China etc., and I
>> would like to make sure people do not miss it because it is a holiday
>> season.
>>
>> [ ] +1 Release this package as Apache Spark 3.1.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see http://spark.apache.org/
>>
>> The tag to be voted on is v3.1.1-rc2 (commit
>> cf0115ac2d60070399af481b14566f33d22ec45e):
>> https://github.com/apache/spark/tree/v3.1.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> <https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc1-bin/>
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1365
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-docs/
>>
>> The list of bug fixes going into 3.1.1 can be found at the following URL:
>> https://s.apache.org/41kf2
>>
>> This release is using the release script of the tag v3.1.1-rc2.
>>
>> FAQ
>>
>> ===
>> What happened to 3.1.0?
>> ===
>>
>> There was a technical issue during Apache Spark 3.1.0 preparation, and it
>> was discussed and decided to skip 3.1.0.
>> Please see
>> https://spark.apache.org/news/next-official-release-spark-3.1.1.html for
>> more details.
>>
>> =
>> How can I help test this release?
>> =
>>
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC via "pip install
>> https://dist.apache.org/repos/dist/dev/spark/v3.1.1-rc2-bin/pyspark-3.1.1.tar.gz
>> "
>> and see if anything important breaks.
>> In the Java/Scala, you can add the staging repository to your projects
>> resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with an out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.1.1?
>> ===
>>
>> The current list of open tickets targeted at 3.1.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.1.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>>
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>

-- 
---
Takeshi Yamamuro


Re: When is the Spark 3.1 release date?

2021-01-09 Thread Takeshi Yamamuro
Hi,

We've already started a vote for the v3.1 release:
https://www.mail-archive.com/dev@spark.apache.org/msg27133.html
But, I think we need more time for the official release.
Please keep watching vote threads in the spark-dev mailing list if you're
interested in it.

Bests,
Takeshi

On Sat, Jan 9, 2021 at 3:02 PM Vivek Bhaskar  wrote:

> I see early Jan for voting date?
> https://spark.apache.org/versioning-policy.html
>
> Regards,
> Vivek
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 3.1.0 (RC1)

2021-01-07 Thread Takeshi Yamamuro
> I will prepare to upload news in spark-website to explain that 3.1.0 is
incompletely published because there was something wrong during the release
process, and we go to 3.1.1 right away.

+1

Thanks for working on this, Hyukjin!

Bests,
Takeshi

On Thu, Jan 7, 2021 at 2:13 PM Hyukjin Kwon  wrote:

> Thank you Holden and Wenchen!
>
> Let me:
> - prepare a PR for news in spark-website first about 3.1.0 accident late
> tonight (in KST)
> - and start to prepare 3.1.1 probably in few more days like next monday in
> case other people have different thoughts
>
>
>
> 2021년 1월 7일 (목) 오후 2:04, Holden Karau 님이 작성:
>
>> I think that posting the 3.1.0 maven release was an accident and we're
>> going to 3.1.1 RCs is the right step forward.
>> I'd ask for maybe a day before cutting the 3.1.1 release, I think
>> https://issues.apache.org/jira/browse/SPARK-34018 is also a blocker (at
>> first I thought it was just a test issue, but Dongjoon pointed out the NPE
>> happens in prod too).
>>
>> I'd also like to echo the: it's totally ok we all make mistakes
>> especially in partially manual & partially automated environments, I've
>> created a bunch of RCs labels without recognizing they were getting pushed
>> automatically.
>>
>> On Wed, Jan 6, 2021 at 8:57 PM Wenchen Fan  wrote:
>>
>>> I agree with Jungtaek that people are likely to be biased when testing
>>> 3.1.0. At least this will not be the same community-blessed release as
>>> previous ones, because the voting is already affected by the fact that
>>> 3.1.0 is already in maven central. Skipping 3.1.0 sounds better to me.
>>>
>>> On Thu, Jan 7, 2021 at 12:54 PM Hyukjin Kwon 
>>> wrote:
>>>
>>>> Okay, let me just start to prepare 3.1.1. I think that will address all
>>>> concerns except that 3.1.0 will remain in Maven as incomplete.
>>>> By right, removal in the Maven repo is disallowed. Overwrite is
>>>> possible as far as I know but other mirrors that maintain cache will get
>>>> affected.
>>>> Maven is one of the downstream publish channels, and we haven't
>>>> officially announced and published it to Apache repo anyway.
>>>> I will prepare to upload news in spark-website to explain that 3.1.0 is
>>>> incompletely published because there was something wrong during the release
>>>> process, and we go to 3.1.1 right away.
>>>> Are we all good with this?
>>>>
>>>>
>>>>
>>>> 2021년 1월 7일 (목) 오후 1:11, Hyukjin Kwon 님이 작성:
>>>>
>>>>> I think that It would be great though if we have a clear blocker that
>>>>> makes the release pointless if we want to drop this RC practically given
>>>>> that we will schedule 3.1.1 faster - non-regression bug fixes will be
>>>>> delivered to end users relatively fast.
>>>>> That would make it clear which option we should take. I personally
>>>>> don't mind dropping 3.1.0 as well; we'll have to wait for the INFRA team's
>>>>> response anyway.
>>>>>
>>>>>
>>>>> 2021년 1월 7일 (목) 오후 1:03, Sean Owen 님이 작성:
>>>>>
>>>>>> I don't agree the first two are blockers for reasons I gave earlier.
>>>>>> Those two do look like important issues - are they regressions from
>>>>>> 3.0.1?
>>>>>> I do agree we'd probably cut a new RC for those in any event, so
>>>>>> agree with the plan to drop 3.1.0 (if the Maven release can't be
>>>>>> overwritten)
>>>>>>
>>>>>> On Wed, Jan 6, 2021 at 9:38 PM Dongjoon Hyun 
>>>>>> wrote:
>>>>>>
>>>>>>> Before we discover the pre-uploaded artifacts, both Jungtaek and
>>>>>>> Hyukjin already made two blockers shared here.
>>>>>>> IIUC, it meant implicitly RC1 failure at that time.
>>>>>>>
>>>>>>> In addition to that, there are two correctness issues. So, I made up
>>>>>>> my mind to cast -1 for this RC1 before joining this thread.
>>>>>>>
>>>>>>> SPARK-34011 ALTER TABLE .. RENAME TO PARTITION doesn't refresh cache
>>>>>>> (committed after tagging)
>>>>>>> SPARK-34027 ALTER TABLE .. RECOVER PARTITIONS doesn't refresh cache
>>>>>>> (PR is under review)
>>>>>>>
>>>>>>> Although the above issues are not regression, those are enough for
>>>>>>> me to give -1 for 3.1.0 RC1.
>>>>>>>
>>>>>>> On Wed, Jan 6, 2021 at 3:52 PM Sean Owen  wrote:
>>>>>>>
>>>>>>>> I just don't see a reason to believe there's a rush? just test it
>>>>>>>> as normal? I did, you can too, etc.
>>>>>>>> Or specifically what blocks the current RC?
>>>>>>>>
>>>>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
---
Takeshi Yamamuro


Re: [build system] WE'RE LIVE!

2020-12-01 Thread Takeshi Yamamuro
Many thanks, guys!
I've checked I can re-trigger Jenkins tests.

Bests,
Takeshi

On Wed, Dec 2, 2020 at 9:55 AM shane knapp ☠  wrote:

> https://amplab.cs.berkeley.edu/jenkins/
>
> i cleared the build queue, so you'll need to retrigger your PRs.  there
> will be occasional downtime over the next few days and weeks as we uncover
> system-level errors and more reimaging happens...  but for now, we're
> building.
>
> a big thanks goes out to jon for his work on the project!  we couldn't
> have done it w/o him.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


Re: jenkins downtime tomorrow evening/weekend

2020-11-21 Thread Takeshi Yamamuro
Thanks for the work, Shane!


On Sun, Nov 22, 2020 at 8:53 AM shane knapp ☠  wrote:

> this is starting now
>
> On Thu, Nov 19, 2020 at 4:34 PM shane knapp ☠  wrote:
>
>> i'm going to be upgrading jenkins to something more reasonable, and there
>> will definitely be some downtime as i get things sorted.
>>
>> we should be back up and building by monday.
>>
>> shane
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] Standardize Spark Exception Messages SPIP

2020-11-08 Thread Takeshi Yamamuro
+1

On Thu, Nov 5, 2020 at 3:41 AM Xinyi Yu  wrote:

> Hi all,
>
> We had the discussion of SPIP: Standardize Spark Exception Messages at
>
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html
> <
> http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-SPIP-Standardize-Spark-Exception-Messages-td30341.html>
>
> . The SPIP document link is at
>
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
> <
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing>
>
> . We want to have the vote on this, for 72 hours.
>
> Please vote before November 7th at noon:
>
> [ ] +1: Accept this SPIP proposal
> [ ] -1: Do not agree to standardize Spark exception messages, because ...
>
>
> Thanks for your time and feedback!
>
> --
> Xinyi
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [DISCUSS][SPIP] Standardize Spark Exception Messages

2020-10-29 Thread Takeshi Yamamuro
I've already left comments about this idea in
https://github.com/apache/spark/pull/29497 though,
it sounds plausible. +1.

On Mon, Oct 26, 2020 at 9:04 AM Xinyi Yu  wrote:

> Hi all,
>
> We like to post a SPIP of Standardize Exception Messages in Spark. Here is
> the document link:
>
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing
> <
> https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing>
>
>
> This SPIP aims to standardize the exception messages in Spark. It has three
> major focuses:
> 1. Group exception messages in dedicated files for easy maintenance and
> auditing.
> 2. Establish an error message guideline for developers.
> 3. Improve error message quality.
>
> Thanks for your time and patience. Looking forward to your feedback!
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [build system] jenkins wedged again

2020-10-14 Thread Takeshi Yamamuro
Thank you always, Shane!


On Thu, Oct 15, 2020 at 4:55 AM shane knapp ☠  wrote:

> everything's up and jenkins is slowly chewing through the queue!  :)
>
> On Wed, Oct 14, 2020 at 12:00 PM Xiao Li  wrote:
>
>> Thank you, Shane!
>>
>> Xiao
>>
>> On Wed, Oct 14, 2020 at 12:00 PM shane knapp ☠ 
>> wrote:
>>
>>> we're mostly back up, and just waiting for a couple of ubuntu boxes to
>>> finish booting...  prb seem to be building now!
>>>
>>> On Wed, Oct 14, 2020 at 11:48 AM shane knapp ☠ 
>>> wrote:
>>>
>>>> i'm going to reboot the primary and worker nodes, so it'll be a few
>>>> minutes before everything is back up.
>>>>
>>>> shane
>>>> --
>>>> Shane Knapp
>>>> Computer Guy / Voice of Reason
>>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>>> https://rise.cs.berkeley.edu
>>>>
>>>
>>>
>>> --
>>> Shane Knapp
>>> Computer Guy / Voice of Reason
>>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>>> https://rise.cs.berkeley.edu
>>>
>>
>>
>> --
>>
>>
>
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-15 Thread Takeshi Yamamuro
+1, looks interesting.

On Wed, Sep 16, 2020 at 11:29 AM 郑瑞峰  wrote:

> +1
>
>
> -- 原始邮件 --
> *发件人:* "叶先进" ;
> *发送时间:* 2020年9月15日(星期二) 晚上7:09
> *收件人:* "Yi Wu";
> *抄送:* "Wenchen Fan";"Dongjoon Hyun"<
> dongjoon.h...@gmail.com>;"kalyan";"Joseph Torres"<
> joseph.tor...@databricks.com>;"angers.zhu";"Xiao
> Li";"csi...@apache.org";"Tom
> Graves";"Apache Spark Dev"<
> dev@spark.apache.org>;"Mridul Muralidharan";"DB Tsai"<
> dbt...@dbtsai.com>;
> *主题:* Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve
> shuffle efficiency
>
> +1.
>
> On Sep 15, 2020, at 5:10 PM, Yi Wu  wrote:
>
> +1. Thank you for the effort!
>
> Best regards,
> Yi
>
> On Tue, Sep 15, 2020 at 3:44 PM Wenchen Fan  wrote:
>
>> +1
>>
>> On Tue, Sep 15, 2020 at 2:42 PM Dongjoon Hyun 
>> wrote:
>>
>>> +1
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Mon, Sep 14, 2020 at 9:19 PM kalyan  wrote:
>>>
>>>> +1
>>>>
>>>> Will positively improve the performance and reliability of spark...
>>>> Looking fwd to this..
>>>>
>>>> Regards
>>>> Kalyan.
>>>>
>>>> On Tue, Sep 15, 2020, 9:26 AM Joseph Torres <
>>>> joseph.tor...@databricks.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Mon, Sep 14, 2020 at 6:39 PM angers.zhu 
>>>>> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> angers.zhu
>>>>>> angers@gmail.com
>>>>>>
>>>>>> <https://maas.mail.163.com/dashi-web-extend/html/proSignature.html?ftlId=1&name=angers.zhu&uid=angers.zhu%40gmail.com&iconUrl=https%3A%2F%2Fmail-online.nosdn.127.net%2Fqiyelogo%2FdefaultAvatar.png&items=%5B%22angers.zhu%40gmail.com%22%5D>
>>>>>> 签名由 网易邮箱大师 <https://mail.163.com/dashi/dlpro.html?from=mail81> 定制
>>>>>>
>>>>>> On 09/15/2020 08:21,Xiao Li
>>>>>>  wrote:
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> Xiao
>>>>>>
>>>>>> DB Tsai  于2020年9月14日周一 下午4:09写道:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Chandni
>>>>>>>>
>>>>>>>> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves <
>>>>>>>> tgraves...@yahoo.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul
>>>>>>>>> Muralidharan  wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I'd like to call for a vote on SPARK-30602 - SPIP: Support
>>>>>>>>> push-based shuffle to improve shuffle efficiency.
>>>>>>>>> Please take a look at:
>>>>>>>>>
>>>>>>>>>- SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602
>>>>>>>>>- SPIP doc:
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1mYzKVZllA5Flw8AtoX7JUcXBOnNIDADWRbJ7GI6Y71Q/edit
>>>>>>>>>- POC against master and results summary :
>>>>>>>>>
>>>>>>>>> https://docs.google.com/document/d/1Q5m7YAp0HyG_TNFL4p_bjQgzzw33ik5i49Vr86UNZgg/edit
>>>>>>>>>
>>>>>>>>> Active discussions on the jira and SPIP document have settled.
>>>>>>>>>
>>>>>>>>> I will leave the vote open until Friday (the 18th September
>>>>>>>>> 2020), 5pm CST.
>>>>>>>>>
>>>>>>>>> [ ] +1: Accept the proposal as an official SPIP
>>>>>>>>> [ ] +0
>>>>>>>>> [ ] -1: I don't think this is a good idea because ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Mridul
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Sincerely,
>>>>>>>
>>>>>>> DB Tsai
>>>>>>> --
>>>>>>> Web: https://www.dbtsai.com
>>>>>>> PGP Key ID: 42E5B25A8F7A82C1
>>>>>>>
>>>>>>
>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Announcing Apache Spark 3.0.1

2020-09-11 Thread Takeshi Yamamuro
Congrats and thanks, Ruifeng!


On Fri, Sep 11, 2020 at 9:50 PM Dongjoon Hyun 
wrote:

> It's great. Thank you, Ruifeng!
>
> Bests,
> Dongjoon.
>
> On Fri, Sep 11, 2020 at 1:54 AM 郑瑞峰  wrote:
>
>> Hi all,
>>
>> We are happy to announce the availability of Spark 3.0.1!
>> Spark 3.0.1 is a maintenance release containing stability fixes. This
>> release is based on the branch-3.0 maintenance branch of Spark. We strongly
>> recommend all 3.0 users to upgrade to this stable release.
>>
>> To download Spark 3.0.1, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or to use
>> `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-3-0-1.html
>>
>> We would like to acknowledge all community members for contributing to
>> this release. This release would not have been possible without you.
>>
>>
>> Thanks,
>> Ruifeng Zheng
>>
>>

-- 
---
Takeshi Yamamuro


Re: Question about Expression Encoders

2020-08-23 Thread Takeshi Yamamuro
Hi,

Have you tried it like this?

--
{ r: InternalRow => enc1.fromRow(r) }

===>

{ r: InternalRow =>
  val fromRow = enc1.createDeserializer()
  fromRow(r)
}

https://github.com/apache/spark/commit/e7fef70fbbea08a38316abdaa9445123bb8c39e2

Bests,
Takeshi

On Thu, Aug 20, 2020 at 1:52 PM Mark Hamilton
 wrote:

> Dear Spark Developers,
>
>
>
> In our teams Spark Library <http://aka.ms/mmlspark> we utilize
> ExpressionEncoders to help us automatically generate spark SQL types from
> scala case classes.
>
>
>
>
> https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/core/schema/SparkBindings.scala
>
>
>
> However it seems in 3.0 the ability to get types from internal rows and
> rows has been removed. Is there any guidance on how to get similar behavior
> in 3.0? Thanks for your help!
>
>
>
> Best,
>
> Mark
>


-- 
---
Takeshi Yamamuro


Re: 回复: [DISCUSS] Apache Spark 3.0.1 Release

2020-08-16 Thread Takeshi Yamamuro
I've checked the Jenkins log and It seems the commit from
https://github.com/apache/spark/pull/29404 caused the failure.


On Sat, Aug 15, 2020 at 10:43 PM Koert Kuipers  wrote:

> i noticed commit today that seems to prepare for 3.0.1-rc1:
> commit 05144a5c10cd37ebdbb55fde37d677def49af11f
> Author: Ruifeng Zheng 
> Date:   Sat Aug 15 01:37:47 2020 +
>
> Preparing Spark release v3.0.1-rc1
>
> so i tried to build spark on that commit and i get failure in sql:
>
> 09:36:57.371 ERROR org.apache.spark.scheduler.TaskSetManager: Task 0 in
> stage 77.0 failed 1 times; aborting job
> [info] - SPARK-28224: Aggregate sum big decimal overflow *** FAILED ***
> (306 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 77.0 failed 1 times, most recent failure: Lost
> task 0.0 in stage 77.0 (TID 197, 192.168.11.17, executor driver):
> java.lang.ArithmeticException:
> Decimal(expanded,0.246000,39,18}) cannot be
> represented as Decimal(38, 18).
> [info] at org.apache.spark.sql.types.Decimal.toPrecision(Decimal.scala:369)
> [info] at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregate_sum_0$(Unknown
> Source)
> [info] at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doConsume_0$(Unknown
> Source)
> [info] at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithoutKey_0$(Unknown
> Source)
> [info] at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
> Source)
> [info] at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> [info] at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
> [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
> [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804)
> [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227)
> [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227)
> [info] at
> org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2138)
> [info] at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> [info] at org.apache.spark.scheduler.Task.run(Task.scala:127)
> [info] at
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
> [info] at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
> [info] at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [info] at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [info] at java.lang.Thread.run(Thread.java:748)
>
> [error] Failed tests:
> [error] org.apache.spark.sql.DataFrameSuite
>
> On Thu, Aug 13, 2020 at 8:19 PM Jason Moore
>  wrote:
>
>> Thank you so much!  Any update on getting the RC1 up for vote?
>>
>> Jason.
>>
>>
>> --
>> *From:* 郑瑞峰 
>> *Sent:* Wednesday, 5 August 2020 12:54 PM
>> *To:* Jason Moore ; Spark dev list <
>> dev@spark.apache.org>
>> *Subject:* 回复: [DISCUSS] Apache Spark 3.0.1 Release
>>
>> Hi all,
>> I am going to prepare the realease of 3.0.1 RC1, with the help of Wenchen.
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Jason Moore" ;
>> *发送时间:* 2020年7月30日(星期四) 上午10:35
>> *收件人:* "dev";
>> *主题:* Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>> Hi all,
>>
>>
>>
>> Discussion around 3.0.1 seems to have trickled away.  What was blocking
>> the release process kicking off?  I can see some unresolved bugs raised
>> against 3.0.0, but conversely there were quite a few critical correctness
>> fixes waiting to be released.
>>
>>
>>
>> Cheers,
>>
>> Jason.
>>
>>
>>
>> *From: *Takeshi Yamamuro 
>> *Date: *Wednesday, 15 July 2020 at 9:00 am
>> *To: *Shivaram Venkataraman 
>> *Cc: *"dev@spark.apache.org" 
>> *Subject: *Re: [DISCUSS] Apache Spark 3.0.1 Release
>>
>>
>>
>> > Just wanted to check if there are any blockers that we are still
>> waiting for to start the new release process.
>>
>> I don't see any on-going blocker in my area.
>>
>> Thanks for the notific

Re: [VOTE] Release Spark 2.4.7 (RC1)

2020-08-08 Thread Takeshi Yamamuro
C and see if anything important breaks, in the Java/Scala
>>>> > you can add the staging repository to your projects resolvers and test
>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>> > you don't end up building with an out of date RC going forward).
>>>> >
>>>> > ===
>>>> > What should happen to JIRA tickets still targeting 2.4.7?
>>>> > =======
>>>> >
>>>> > The current list of open tickets targeted at 2.4.7 can be found at:
>>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 2.4.7
>>>> >
>>>> > Committers should look at those and triage. Extremely important bug
>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>> > be worked on immediately. Everything else please retarget to an
>>>> > appropriate release.
>>>> >
>>>> > ==
>>>> > But my bug isn't fixed?
>>>> > ==
>>>> >
>>>> > In order to make timely releases, we will typically not hold the
>>>> > release unless the bug in question is a regression from the previous
>>>> > release. That being said, if there is something which is a regression
>>>> > that has not been correctly targeted please ping me or a committer to
>>>> > help target the issue.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>

-- 
---
Takeshi Yamamuro


Re: spark-on-k8s is still experimental?

2020-08-05 Thread Takeshi Yamamuro
Thanks for the info, all. okay, I understood that we need more time to
announce GA officially.
But, I'm still worried that users hesitate a bit to use this feature by
referring to the statement in the doc,
so how about updating it according to the current situation? Please check
my suggestion in https://github.com/apache/spark/pull/29368.

Anyway, many thanks!


On Tue, Aug 4, 2020 at 12:26 AM Holden Karau  wrote:

> There was discussion around removing the statement and declaring it GA but
> I believe it was decided to leave it in until an external shuffle service
> is supported on K8s.
>
> On Mon, Aug 3, 2020 at 2:45 AM JackyLee  wrote:
>
>> +1. It has been worked well in our company and we has used it to support
>> online services since March in this year.
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
---
Takeshi Yamamuro


spark-on-k8s is still experimental?

2020-08-03 Thread Takeshi Yamamuro
Hi, all

A Spark user I know asked me this question.
I checked the the spark-on-k8s document and it says;
```
**The Kubernetes scheduler is currently experimental.
In future versions, there may be behavioral changes around configuration,
container images and entrypoints.**
```
https://github.com/apache/spark/blame/master/docs/running-on-kubernetes.md#L26-L28

This statement was added when v2.3.0 released (
https://issues.apache.org/jira/browse/SPARK-23104).
Does this mean that the v2.3.0 and v3.0.0 releases have the same
development stage?
I'm a bit worried that users read the statement in the doc and give up
starting this feature.
(In fact, the user who asked me seemed to think so)

I'm not familiar with that part, so does anyone know
when this statement can be removed from the doc?
Any milestone for that?

Thanks in advance,
Takeshi

-- 
---
Takeshi Yamamuro


Re: Contributing to JIRA Maintenance

2020-08-01 Thread Takeshi Yamamuro
>>> >> > I would like to ask for some help about JIRA maintenance
>>> contributions in Apache Spark.
>>> >> > I tend to see less and less people active in JIRA maintenance
>>> contributions.
>>> >> >
>>> >> > I have regularly checked all JIRAs and monitored them continuously
>>> for the last 4 years.
>>> >> > For the last week, I didn't have time to take a look, and I felt
>>> frustrated that there are
>>> >> > many JIRAs that look clearly needing action. Here are the examples
>>> only from the last week:
>>> >> >
>>> >> > Exact duplication:
>>> >> > Resolve one and link another one as a duplicate.
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32370
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32369
>>> >> >
>>> >> > Different languages:
>>> >> > Ask English translations which dev people use to communicate.
>>> >> > If the reporter is inactive, we can resolve it till then.
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32355
>>> >> >
>>> >> > No JIRA description:
>>> >> >  Ask to fill the JIRA description. Not so many people know what
>>> the issue the
>>> >> > JIRA describes just from reading the title which will end up
>>> that nobody can work
>>> >> > on the JIRA.
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32361
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32359
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32388
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32390
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32400
>>> >> >
>>> >> > Malformed image:
>>> >> > If the attached image looks malformed to you, ask to fix.
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32433
>>> >> >
>>> >> > Questions:
>>> >> > Questions should usually go to mailing list or stackoverflow
>>> per http://spark.apache.org/community.html
>>> >> > - https://issues.apache.org/jira/browse/SPARK-32460
>>> >> >
>>> >> >
>>> >> > There is clear guidance about JIRA maintenance "Contributing to
>>> JIRA Maintenance"
>>> >> > in http://spark.apache.org/contributing.html (thanks @Sean Owen
>>> for writing this).
>>> >> > I hope to see more people and ask for some help in the JIRA
>>> maintenance.
>>> >> >
>>> >> > FWIW, at least I, as a PMC, monitor most of these JIRA maintenance
>>> contributions from the
>>> >> > community and take them into account when/where it should be.
>>> >> >
>>> >> >
>>> >> > Thanks all in advance.
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [PSA] Apache Spark uses GitHub Actions to run the tests

2020-07-14 Thread Takeshi Yamamuro
Thanks, Hyukjin!

> Therefore, I do believe PRs can be merged in most general cases once the
Jenkins PR
builder or Github Actions build passes

greatly helpful!

Bests,
Takeshi

On Tue, Jul 14, 2020 at 4:14 PM Hyukjin Kwon  wrote:

> Perfect. Plus, Github Actions is only for master branch at this moment.
>
> BTW, I think we can enable Java(Scala) doc build and dependency test back
> in Jenkins for simplicity.
> Seems like the Jenkins machine came back to normal.
>
> 2020년 7월 14일 (화) 오후 4:08, Wenchen Fan 님이 작성:
>
>> To clarify, we need to wait for:
>> 1. Java documentation build test in github actions
>> 2. dependency test in github actions
>> 3. either github action all green or jenkin pass
>>
>> If the PR touches Kinesis, or it uses other profiles, we must wait for
>> jenkins to pass.
>>
>> Do I miss something?
>>
>> On Tue, Jul 14, 2020 at 2:18 PM Hyukjin Kwon  wrote:
>>
>>> Hi dev,
>>>
>>> Github Actions build was introduced to run the regular Spark test cases
>>> at https://github.com/apache/spark/pull/29057and
>>> https://github.com/apache/spark/pull/29086.
>>> This is virtually the duplication of default Jenkins PR builder at this
>>> moment.
>>>
>>> The only differences are:
>>> - Github Actions does not run the tests for Kinesis, see SPARK-32246
>>> - Github Actions does not support other profiles such as JDK 11 or Hive
>>> 1.2, see SPARK-32255
>>> - Jenkins build does not run Java documentation build, see SPARK-32233
>>> - Jenkins build does not run the dependency test, see SPARK-32178
>>>
>>> Therefore, I do believe PRs can be merged in most general cases once the
>>> Jenkins PR
>>> builder or Github Actions build passes when we expect the successful
>>> test results from
>>> the default Jenkins PR builder.
>>>
>>> Thanks.
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-07-14 Thread Takeshi Yamamuro
> Just wanted to check if there are any blockers that we are still waiting
for to start the new release process.
I don't see any on-going blocker in my area.
Thanks for the notification.

Bests,
Tkaeshi

On Wed, Jul 15, 2020 at 4:03 AM Dongjoon Hyun 
wrote:

> Hi, Yi.
>
> Could you explain why you think that is a blocker? For the given example
> from the JIRA description,
>
> spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
>
> Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")
>
> checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)
>
>
> Apache Spark 3.0.0 seems to work like the following.
>
> scala> spark.version
> res0: String = 3.0.0
>
> scala> spark.udf.register("key", udf((m: Map[String, String]) =>
> m.keys.head.toInt))
> res1: org.apache.spark.sql.expressions.UserDefinedFunction =
> SparkUserDefinedFunction($Lambda$1958/948653928@5d6bed7b,IntegerType,List(Some(class[value[0]:
> map])),None,false,true)
>
> scala> Seq(Map("1" -> "one", "2" ->
> "two")).toDF("a").createOrReplaceTempView("t")
>
> scala> sql("SELECT key(a) AS k FROM t GROUP BY key(a)").collect
> res3: Array[org.apache.spark.sql.Row] = Array([1])
>
>
> Could you provide a reproducible example?
>
> Bests,
> Dongjoon.
>
>
> On Tue, Jul 14, 2020 at 10:04 AM Yi Wu  wrote:
>
>> This probably be a blocker:
>> https://issues.apache.org/jira/browse/SPARK-32307
>>
>> On Tue, Jul 14, 2020 at 11:13 PM Sean Owen  wrote:
>>
>>> https://issues.apache.org/jira/browse/SPARK-32234 ?
>>>
>>> On Tue, Jul 14, 2020 at 9:57 AM Shivaram Venkataraman
>>>  wrote:
>>> >
>>> > Hi all
>>> >
>>> > Just wanted to check if there are any blockers that we are still
>>> waiting for to start the new release process.
>>> >
>>> > Thanks
>>> > Shivaram
>>> >
>>>
>>

-- 
---
Takeshi Yamamuro


Re: Welcoming some new Apache Spark committers

2020-07-14 Thread Takeshi Yamamuro
Congrats, all!

On Wed, Jul 15, 2020 at 5:15 AM Takuya UESHIN 
wrote:

> Congrats and welcome!
>
> On Tue, Jul 14, 2020 at 1:07 PM Bryan Cutler  wrote:
>
>> Congratulations and welcome!
>>
>> On Tue, Jul 14, 2020 at 12:36 PM Xingbo Jiang 
>> wrote:
>>
>>> Welcome, Huaxin, Jungtaek, and Dilip!
>>>
>>> Congratulations!
>>>
>>> On Tue, Jul 14, 2020 at 10:37 AM Matei Zaharia 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> The Spark PMC recently voted to add several new committers. Please join
>>>> me in welcoming them to their new roles! The new committers are:
>>>>
>>>> - Huaxin Gao
>>>> - Jungtaek Lim
>>>> - Dilip Biswal
>>>>
>>>> All three of them contributed to Spark 3.0 and we’re excited to have
>>>> them join the project.
>>>>
>>>> Matei and the Spark PMC
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>
> --
> Takuya UESHIN
>
>

-- 
---
Takeshi Yamamuro


Re: Jenkins is down

2020-07-05 Thread Takeshi Yamamuro
Great! Thanks, Shane!

On Mon, Jul 6, 2020 at 10:21 AM Hyukjin Kwon  wrote:

> Thanks Shane!
>
> 2020년 7월 6일 (월) 오전 9:30, shane knapp ☠ 님이 작성:
>
>> hey all, i was out of town for the weekend and noticed it was down this
>> morning and restarted the service.  it's been pretty flaky recently, so
>> i'll take a much closer look at things this coming week.
>>
>> On Sun, Jul 5, 2020 at 1:14 PM Dongjoon Hyun 
>> wrote:
>>
>>> Hi, All.
>>>
>>> Now, AmpLab Jenkins farm came back online.
>>>
>>>
>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/
>>>
>>> Also, many PRBuilder jobs were re-started 10 minutes ago.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>>
>>> On Fri, Jul 3, 2020 at 4:43 AM Hyukjin Kwon  wrote:
>>>
>>>> Hi all and Shane,
>>>>
>>>> Is there something wrong with the Jenkins machines? Seems they are down.
>>>>
>>>
>>
>> --
>> Shane Knapp
>> Computer Guy / Voice of Reason
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Decommissioning SPIP

2020-07-02 Thread Takeshi Yamamuro
+1; Looks a great feature.

On Thu, Jul 2, 2020 at 3:33 PM devesh.agra...@gmail.com <
devesh.agra...@gmail.com> wrote:

> +1
>
> This proposal will improve the operation and cost of Spark on cloud
> environments.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Contribute to Apache Spark

2020-06-30 Thread Takeshi Yamamuro
Hi,

Thanks for your interest!
Please read the contribution guide first:
https://spark.apache.org/contributing.html

We don't have such a permission and you can file issues in Jira by yourself,
then make a PR for them.

Enjoy your work!

On Tue, Jun 30, 2020 at 2:34 PM 飘鹅玉雪 <397189...@qq.com> wrote:

> Hi,
> I want to contribute to Apache Spark.
> Would you please give me the contributor permission?
> My JIRA ID is suizhe007.
>


-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Apache Spark 3.0.1 Release

2020-06-23 Thread Takeshi Yamamuro
Thanks for the heads-up, Yuanjian!

> I also noticed branch-3.0 already has 39 commits after Spark 3.0.0.
wow, the updates are so quick. Anyway, +1 for the release.

Bests,
Takeshi

On Tue, Jun 23, 2020 at 4:59 PM Yuanjian Li  wrote:

> Hi dev-list,
>
> I’m writing this to raise the discussion about Spark 3.0.1 feasibility
> since 4 blocker issues were found after Spark 3.0.0:
>
>
>1.
>
>[SPARK-31990] <https://issues.apache.org/jira/browse/SPARK-31990> The
>state store compatibility broken will cause a correctness issue when
>Streaming query with `dropDuplicate` uses the checkpoint written by the old
>Spark version.
>2.
>
>[SPARK-32038] <https://issues.apache.org/jira/browse/SPARK-32038> The
>regression bug in handling NaN values in COUNT(DISTINCT)
>3.
>
>[SPARK-31918] <https://issues.apache.org/jira/browse/SPARK-31918>[WIP]
>CRAN requires to make it working with the latest R 4.0. It makes the 3.0
>release unavailable on CRAN, and only supports R [3.5, 4.0)
>4.
>
>[SPARK-31967] <https://issues.apache.org/jira/browse/SPARK-31967>
>Downgrade vis.js to fix Jobs UI loading time regression
>
>
> I also noticed branch-3.0 already has 39 commits
> <https://issues.apache.org/jira/browse/SPARK-32038?jql=project%20%3D%20SPARK%20AND%20fixVersion%20%3D%203.0.1>
> after Spark 3.0.0. I think it would be great if we have Spark 3.0.1 to
> deliver the critical fixes.
>
> Any comments are appreciated.
>
> Best,
>
> Yuanjian
>
>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Takeshi Yamamuro
Congrats, all!

Bests,
Takeshi

On Fri, Jun 19, 2020 at 1:16 PM Felix Cheung 
wrote:

> Congrats
>
> --
> *From:* Jungtaek Lim 
> *Sent:* Thursday, June 18, 2020 8:18:54 PM
> *To:* Hyukjin Kwon 
> *Cc:* Mridul Muralidharan ; Reynold Xin <
> r...@databricks.com>; dev ; user <
> u...@spark.apache.org>
> *Subject:* Re: [ANNOUNCE] Apache Spark 3.0.0
>
> Great, thanks all for your efforts on the huge step forward!
>
> On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon  wrote:
>
> Yay!
>
> 2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 님이 작성:
>
> Great job everyone ! Congratulations :-)
>
> Regards,
> Mridul
>
> On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin  wrote:
>
> Hi all,
>
> Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many
> of the innovations from Spark 2.x, bringing new ideas as well as continuing
> long-term projects that have been in development. This release resolves
> more than 3400 tickets.
>
> We'd like to thank our contributors and users for their contributions and
> early feedback to this release. This release would not have been possible
> without you.
>
> To download Spark 3.0.0, head over to the download page:
> http://spark.apache.org/downloads.html
>
> To view the release notes:
> https://spark.apache.org/releases/spark-release-3-0-0.html
>
>
>
>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Apache Spark 2.4.6 released

2020-06-10 Thread Takeshi Yamamuro
Congrats and thanks, Holden!

Bests,
Takeshi

On Thu, Jun 11, 2020 at 11:16 AM Dongjoon Hyun 
wrote:

> Thank you so much, Holden! :)
>
> On Wed, Jun 10, 2020 at 6:59 PM Hyukjin Kwon  wrote:
>
>> Yay!
>>
>> 2020년 6월 11일 (목) 오전 10:38, Holden Karau 님이 작성:
>>
>>> We are happy to announce the availability of Spark 2.4.6!
>>>
>>> Spark 2.4.6 is a maintenance release containing stability, correctness,
>>> and security fixes.
>>> This release is based on the branch-2.4 maintenance branch of Spark. We
>>> strongly recommend all 2.4 users to upgrade to this stable release.
>>>
>>> To download Spark 2.4.6, head over to the download page:
>>> http://spark.apache.org/downloads.html
>>> Spark 2.4.6 is also available in Maven Central, PyPI, and CRAN.
>>>
>>> Note that you might need to clear your browser cache or
>>> to use `Private`/`Incognito` mode according to your browsers.
>>>
>>> To view the release notes:
>>> https://spark.apache.org/releases/spark-release-2.4.6.html
>>>
>>> We would like to acknowledge all community members for contributing to
>>> this
>>> release. This release would not have been possible without you.
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [vote] Apache Spark 3.0 RC3

2020-06-07 Thread Takeshi Yamamuro
+1 (non-binding)

I don't see any ongoing PR to fix critical bugs in my area.
Bests,
Takeshi

On Sun, Jun 7, 2020 at 7:24 PM Mridul Muralidharan  wrote:

> +1
>
> Regards,
> Mridul
>
> On Sat, Jun 6, 2020 at 1:20 PM Reynold Xin  wrote:
>
>> Apologies for the mistake. The vote is open till 11:59pm Pacific time on
>> Mon June 9th.
>>
>> On Sat, Jun 6, 2020 at 1:08 PM Reynold Xin  wrote:
>>
>>> Please vote on releasing the following candidate as Apache Spark version
>>> 3.0.0.
>>>
>>> The vote is open until [DUE DAY] and passes if a majority +1 PMC votes
>>> are cast, with a minimum of 3 +1 votes.
>>>
>>> [ ] +1 Release this package as Apache Spark 3.0.0
>>> [ ] -1 Do not release this package because ...
>>>
>>> To learn more about Apache Spark, please see http://spark.apache.org/
>>>
>>> The tag to be voted on is v3.0.0-rc3 (commit
>>> 3fdfce3120f307147244e5eaf46d61419a723d50):
>>> https://github.com/apache/spark/tree/v3.0.0-rc3
>>>
>>> The release files, including signatures, digests, etc. can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-bin/
>>>
>>> Signatures used for Spark RCs can be found in this file:
>>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>
>>> The staging repository for this release can be found at:
>>> https://repository.apache.org/content/repositories/orgapachespark-1350/
>>>
>>> The documentation corresponding to this release can be found at:
>>> https://dist.apache.org/repos/dist/dev/spark/v3.0.0-rc3-docs/
>>>
>>> The list of bug fixes going into 3.0.0 can be found at the following URL:
>>> https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>
>>> This release is using the release script of the tag v3.0.0-rc3.
>>>
>>> FAQ
>>>
>>> =
>>> How can I help test this release?
>>> =
>>>
>>> If you are a Spark user, you can help us test this release by taking
>>> an existing Spark workload and running on this release candidate, then
>>> reporting any regressions.
>>>
>>> If you're working in PySpark you can set up a virtual env and install
>>> the current RC and see if anything important breaks, in the Java/Scala
>>> you can add the staging repository to your projects resolvers and test
>>> with the RC (make sure to clean up the artifact cache before/after so
>>> you don't end up building with a out of date RC going forward).
>>>
>>> ===
>>> What should happen to JIRA tickets still targeting 3.0.0?
>>> ===
>>>
>>> The current list of open tickets targeted at 3.0.0 can be found at:
>>> https://issues.apache.org/jira/projects/SPARK and search for "Target
>>> Version/s" = 3.0.0
>>>
>>> Committers should look at those and triage. Extremely important bug
>>> fixes, documentation, and API tweaks that impact compatibility should
>>> be worked on immediately. Everything else please retarget to an
>>> appropriate release.
>>>
>>> ==
>>> But my bug isn't fixed?
>>> ==
>>>
>>> In order to make timely releases, we will typically not hold the
>>> release unless the bug in question is a regression from the previous
>>> release. That being said, if there is something which is a regression
>>> that has not been correctly targeted please ping me or a committer to
>>> help target the issue.
>>>
>>>
>>>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Spark 2.4.6 (RC1)

2020-05-10 Thread Takeshi Yamamuro
aking
>>>>> > an existing Spark workload and running on this release candidate,
>>>>> then
>>>>> > reporting any regressions.
>>>>> >
>>>>> > If you're working in PySpark you can set up a virtual env and install
>>>>> > the current RC and see if anything important breaks, in the
>>>>> Java/Scala
>>>>> > you can add the staging repository to your projects resolvers and
>>>>> test
>>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>>> > you don't end up building with a out of date RC going forward).
>>>>> >
>>>>> > ===
>>>>> > What should happen to JIRA tickets still targeting 2.4.6?
>>>>> > ===
>>>>> >
>>>>> > The current list of open tickets targeted at 2.4.5 can be found at:
>>>>> > https://issues.apache.org/jira/projects/SPARK and search for
>>>>> "Target Version/s" = 2.4.6
>>>>> >
>>>>> > Committers should look at those and triage. Extremely important bug
>>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>>> > be worked on immediately. Everything else please retarget to an
>>>>> > appropriate release.
>>>>> >
>>>>> > ==
>>>>> > But my bug isn't fixed?
>>>>> > ==
>>>>> >
>>>>> > In order to make timely releases, we will typically not hold the
>>>>> > release unless the bug in question is a regression from the previous
>>>>> > release. That being said, if there is something which is a regression
>>>>> > that has not been correctly targeted please ping me or a committer to
>>>>> > help target the issue.
>>>>> >
>>>>> > --
>>>>> > Twitter: https://twitter.com/holdenkarau
>>>>> > Books (Learning Spark, High Performance Spark, etc.):
>>>>> https://amzn.to/2MaRAG9
>>>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>
>>>>> -
>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>
>>>>>
>>>
>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] Apache Spark 3.0.0 RC1

2020-04-02 Thread Takeshi Yamamuro
Also, I think the 3.0 release had better to include all the SQL document
updates:
https://issues.apache.org/jira/browse/SPARK-28588

On Fri, Apr 3, 2020 at 12:36 AM Sean Owen  wrote:

> (If it wasn't stated explicitly, yeah I think we knew there are a few
> important unresolved issues and that this RC was going to fail. Let's
> all please test anyway of course, to flush out any additional issues,
> rather than wait. Pipelining and all that.)
>
> On Thu, Apr 2, 2020 at 10:31 AM Maxim Gekk 
> wrote:
> >
> > -1 (non-binding)
> >
> > The problem of compatibility with Spark 2.4 in reading/writing
> dates/timestamps hasn't been solved completely so far. In particular, the
> sub-task https://issues.apache.org/jira/browse/SPARK-31328 hasn't
> resolved yet.
> >
> > Maxim Gekk
> >
> > Software Engineer
> >
> > Databricks, Inc.
> >
> >
> >
> > On Wed, Apr 1, 2020 at 7:09 PM Ryan Blue 
> wrote:
> >>
> >> -1 (non-binding)
> >>
> >> I agree with Jungtaek. The change to create datasource tables instead
> of Hive tables by default (no USING or STORED AS clauses) has created
> confusing behavior and should either be rolled back or fixed before 3.0.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-14 Thread Takeshi Yamamuro
Ah, I see now what the "broken' means. Thanks, Yi.
I personally think the option 1 is the best for existing Spark users to
support the usecase you suggested above.
So, I think this decision depends on how difficult it is to implement "get
Scala lambda parameter types by reflection"
and the complexity of it's implementation.
(I'm not familiar with the 2.12 implementation, so I'm not really sure how
difficult it is)

If we cannot choose the option 1, I like the option 2 better than
adding a new API for the usecase (the option 3).

Bests,
Takeshi

On Sat, Mar 14, 2020 at 6:24 PM wuyi  wrote:

> Hi Takeshi, thanks for your reply.
>
> Before the broken, we only do the null check for primitive types and leave
> null value of non-primitive type to UDF itself in case it will be handled
> specifically, e.g., a UDF may return something else for null String.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Null-handling of primitive-type of untyped Scala UDF in Scala 2.12

2020-03-14 Thread Takeshi Yamamuro
hi, Yi,

Probably, I miss something though, we cannot just wrap the udf with
`if (isnull(x)) null else udf(knownnotnull(x))`?

On Fri, Mar 13, 2020 at 6:22 PM wuyi  wrote:

> Hi all, I'd like to raise a discussion here about null-handling of
> primitive-type of untyped Scala UDF [ udf(f: AnyRef, dataType: DataType) ].
>
> After we switch to Scala 2.12 in 3.0, the untyped Scala UDF is broken
> because now we can't use reflection to get the parameter types of the Scala
> lambda.
> This leads to silent result changing, for example, with UDF defined as `val
> f = udf((x: Int) => x, IntegerType)`, the query `select f($"x")` has
> different
> behavior between 2.4 and 3.0 when the input value of column x is null.
>
> Spark 2.4:  null
> Spark 3.0:  0
>
> Because of it, we deprecate the untyped Scala UDF in 3.0 and recommend
> users
> to use the typed ones. However, recently I identified several valid use
> cases,
> e.g., `val f = (r: Row) => Row(r.getAs[Int](0) * 2)`, where the schema
> cannot be detected in typed Scala UDF [ udf[RT: TypeTag, A1: TypeTag](f:
> Function1[A1, RT]) ].
>
> There are 3 solutions:
> 1. find a way to get Scala lambda parameter types by reflection (I tried it
> very hard but has no luck. The Java SAM type is so dynamic)
> 2. support case class as the input of typed Scala UDF, so at least people
> can still deal with struct type input column with UDF
> 3. add a new variant of untyped Scala UDF which users can specify input
> types
>
> I'd like to see more feedbacks or ideas about how to move forward.
>
> Thanks,
> Yi Wu
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Auto-linking from PRs to Jira tickets

2020-03-11 Thread Takeshi Yamamuro
Cool! Thanks, Dongjoon!

Bests,
Takeshi

On Thu, Mar 12, 2020 at 8:27 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Autolinking from PR to JIRA started.
>
> *Inside PR*
> https://github.com/apache/spark/pull/27881
>
> *Inside commit log*
> https://github.com/apache/spark/commits/master
>
> You don't need to add hyperlink to `SPARK-XXX` manually from now.
>
> Bests,
> Dongjoon.
>
>>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Amend Spark's Semantic Versioning Policy

2020-03-09 Thread Takeshi Yamamuro
>> >>
>>> >> >> The Spark project strives to avoid breaking APIs or silently
>>> changing behavior, even at major versions. While this is not always
>>> possible, the balance of the following factors should be considered before
>>> choosing to break an API.
>>> >> >>
>>> >> >>
>>> >> >> Cost of Breaking an API
>>> >> >>
>>> >> >> Breaking an API almost always has a non-trivial cost to the users
>>> of Spark. A broken API means that Spark programs need to be rewritten
>>> before they can be upgraded. However, there are a few considerations when
>>> thinking about what the cost will be:
>>> >> >>
>>> >> >> Usage - an API that is actively used in many different places, is
>>> always very costly to break. While it is hard to know usage for sure, there
>>> are a bunch of ways that we can estimate:
>>> >> >>
>>> >> >> How long has the API been in Spark?
>>> >> >>
>>> >> >> Is the API common even for basic programs?
>>> >> >>
>>> >> >> How often do we see recent questions in JIRA or mailing lists?
>>> >> >>
>>> >> >> How often does it appear in StackOverflow or blogs?
>>> >> >>
>>> >> >> Behavior after the break - How will a program that works today,
>>> work after the break? The following are listed roughly in order of
>>> increasing severity:
>>> >> >>
>>> >> >> Will there be a compiler or linker error?
>>> >> >>
>>> >> >> Will there be a runtime exception?
>>> >> >>
>>> >> >> Will that exception happen after significant processing has been
>>> done?
>>> >> >>
>>> >> >> Will we silently return different answers? (very hard to debug,
>>> might not even notice!)
>>> >> >>
>>> >> >>
>>> >> >> Cost of Maintaining an API
>>> >> >>
>>> >> >> Of course, the above does not mean that we will never break any
>>> APIs. We must also consider the cost both to the project and to our users
>>> of keeping the API in question.
>>> >> >>
>>> >> >> Project Costs - Every API we have needs to be tested and needs to
>>> keep working as other parts of the project changes. These costs are
>>> significantly exacerbated when external dependencies change (the JVM,
>>> Scala, etc). In some cases, while not completely technically infeasible,
>>> the cost of maintaining a particular API can become too high.
>>> >> >>
>>> >> >> User Costs - APIs also have a cognitive cost to users learning
>>> Spark or trying to understand Spark programs. This cost becomes even higher
>>> when the API in question has confusing or undefined semantics.
>>> >> >>
>>> >> >>
>>> >> >> Alternatives to Breaking an API
>>> >> >>
>>> >> >> In cases where there is a "Bad API", but where the cost of removal
>>> is also high, there are alternatives that should be considered that do not
>>> hurt existing users but do address some of the maintenance costs.
>>> >> >>
>>> >> >>
>>> >> >> Avoid Bad APIs - While this is a bit obvious, it is an important
>>> point. Anytime we are adding a new interface to Spark we should consider
>>> that we might be stuck with this API forever. Think deeply about how new
>>> APIs relate to existing ones, as well as how you expect them to evolve over
>>> time.
>>> >> >>
>>> >> >> Deprecation Warnings - All deprecation warnings should point to a
>>> clear alternative and should never just say that an API is deprecated.
>>> >> >>
>>> >> >> Updated Docs - Documentation should point to the "best"
>>> recommended way of performing a given task. In the cases where we maintain
>>> legacy documentation, we should clearly point to newer APIs and suggest to
>>> users the "right" way.
>>> >> >>
>>> >> >> Community Work - Many people learn Spark by reading blogs and
>>> other sites such as StackOverflow. However, many of these resources are out
>>> of date. Update them, to reduce the cost of eventually removing deprecated
>>> APIs.
>>> >> >>
>>> >> >>
>>> >> >> 
>>> >>
>>> >> -
>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>> >>
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [Proposal] Modification to Spark's Semantic Versioning Policy

2020-03-07 Thread Takeshi Yamamuro
s,
>>>> there are several more going on now, some pretty broad. I am not even sure
>>>> what all of them are. In addition to below,
>>>> https://github.com/apache/spark/pull/27839. Would it be too much
>>>> overhead to post to this thread any changes that one believes are endorsed
>>>> by these principles and perhaps a more strict interpretation of them now?
>>>> It's important enough we should get any data points or input, and now.
>>>> (We're obviously not going to debate each one.) A draft PR, or several,
>>>> actually sounds like a good vehicle for that -- as long as people know
>>>> about them!
>>>>
>>>> Also, is there any usage data available to share? many arguments turn
>>>> around 'commonly used' but can we know that more concretely?
>>>>
>>>> Otherwise I think we'll back into implementing personal interpretations
>>>> of general principles, which is arguably the issue in the first place, even
>>>> when everyone believes in good faith in the same principles.
>>>>
>>>>
>>>>
>>>> On Fri, Mar 6, 2020 at 1:08 PM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Hi, All.
>>>>>
>>>>> Recently, reverting PRs seems to start to spread like the *well-known*
>>>>> virus.
>>>>> Can we finalize this first before doing unofficial personal decisions?
>>>>> Technically, this thread was not a vote and our website doesn't have a
>>>>> clear policy yet.
>>>>>
>>>>> https://github.com/apache/spark/pull/27821
>>>>> [SPARK-25908][SQL][FOLLOW-UP] Add Back Multiple Removed APIs
>>>>> ==> This technically revert most of the SPARK-25908.
>>>>>
>>>>> https://github.com/apache/spark/pull/27835
>>>>> Revert "[SPARK-25457][SQL] IntegralDivide returns data type of the
>>>>> operands"
>>>>>
>>>>> https://github.com/apache/spark/pull/27834
>>>>> Revert [SPARK-24640][SQL] Return `NULL` from `size(NULL)` by default
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Thu, Mar 5, 2020 at 9:08 PM Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>>> Hi, All.
>>>>>>
>>>>>> There is a on-going Xiao's PR referencing this email.
>>>>>>
>>>>>> https://github.com/apache/spark/pull/27821
>>>>>>
>>>>>> Bests,
>>>>>> Dongjoon.
>>>>>>
>>>>>> On Fri, Feb 28, 2020 at 11:20 AM Sean Owen  wrote:
>>>>>>
>>>>>>> On Fri, Feb 28, 2020 at 12:03 PM Holden Karau 
>>>>>>> wrote:
>>>>>>> >> 1. Could you estimate how many revert commits are required in
>>>>>>> `branch-3.0` for new rubric?
>>>>>>>
>>>>>>> Fair question about what actual change this implies for 3.0? so far
>>>>>>> it
>>>>>>> seems like some targeted, quite reasonable reverts. I don't think
>>>>>>> anyone's suggesting reverting loads of changes.
>>>>>>>
>>>>>>>
>>>>>>> >> 2. Are you going to revert all removed test cases for the
>>>>>>> deprecated ones?
>>>>>>> > This is a good point, making sure we keep the tests as well is
>>>>>>> important (worse than removing a deprecated API is shipping it broken),.
>>>>>>>
>>>>>>> (I'd say, yes of course! which seems consistent with what is
>>>>>>> happening now)
>>>>>>>
>>>>>>>
>>>>>>> >> 3. Does it make any delay for Apache Spark 3.0.0 release?
>>>>>>> >> (I believe it was previously scheduled on June before
>>>>>>> Spark Summit 2020)
>>>>>>> >
>>>>>>> > I think if we need to delay to make a better release this is ok,
>>>>>>> especially given our current preview releases being available to gather
>>>>>>> community feedback.
>>>>>>>
>>>>>>> Of course these things block 3.0 -- all the more reason to keep it
>>>>>>> specific and targeted -- but nothing so far seems inconsistent with
>>>>>>> finishing in a month or two.
>>>>>>>
>>>>>>>
>>>>>>> >> Although there was a discussion already, I want to make the
>>>>>>> following tough parts sure.
>>>>>>> >> 4. We are not going to add Scala 2.11 API, right?
>>>>>>> > I hope not.
>>>>>>> >>
>>>>>>> >> 5. We are not going to support Python 2.x in Apache Spark
>>>>>>> 3.1+, right?
>>>>>>> > I think doing that would be bad, it's already end of lifed
>>>>>>> elsewhere.
>>>>>>>
>>>>>>> Yeah this is an important subtext -- the valuable principles here
>>>>>>> could be interpreted in many different ways depending on how much you
>>>>>>> weight value of keeping APIs for compatibility vs value in
>>>>>>> simplifying
>>>>>>> Spark and pushing users to newer APIs more forcibly. They're all
>>>>>>> judgment calls, based on necessarily limited data about the universe
>>>>>>> of users. We can only go on rare direct user feedback, on feedback
>>>>>>> perhaps from vendors as proxies for a subset of users, and the
>>>>>>> general
>>>>>>> good faith judgment of committers who have lived Spark for years.
>>>>>>>
>>>>>>> My specific interpretation is that the standard is (correctly)
>>>>>>> tightening going forward, and retroactively a bit for 3.0. But, I do
>>>>>>> not think anyone is advocating for the logical extreme of, for
>>>>>>> example, maintaining Scala 2.11 compatibility indefinitely. I think
>>>>>>> that falls out readily from the rubric here: maintaining 2.11
>>>>>>> compatibility is really quite painful if you ever support 2.13 too,
>>>>>>> for example.
>>>>>>>
>>>>>>

-- 
---
Takeshi Yamamuro


Re: [DISCUSSION] Avoiding duplicate work

2020-02-21 Thread Takeshi Yamamuro
Yea, +1 to the Sean suggestion.
When we see a comment "I'm working on this" on the jira comment,
I think we need to say "Are you still working on this?" to avoid duplicate
work there.

On Sat, Feb 22, 2020 at 2:20 AM Nicholas Chammas 
wrote:

> +1 to what Sean said.
>
> On Fri, Feb 21, 2020 at 10:14 AM Sean Owen  wrote:
>
>> We've avoided using Assignee because it implies that someone 'owns'
>> resolving the issue, when we want to keep it collaborative, and many
>> times in the past someone would ask to be assigned and then didn't
>> follow through.
>>
>> You can comment on the JIRA to say "I'm working on this" but that has
>> the same problem. Frequently people see that and don't work on it, and
>> then the original person doesn't follow through either.
>>
>> The best practice is probably to write down your analysis of the
>> problem and solution so far in a comment. That helps everyone and
>> doesn't suggest others shouldn't work on it; we want them to, we want
>> them to work together. That also shows some commitment to working on
>> it.
>>
>>
>> On Fri, Feb 21, 2020 at 9:11 AM younggyu Chun 
>> wrote:
>> >
>> > what if both are looking at code and they don't make a merge request? I
>> guess we can't still see what's going on because that Jira ticket won't
>> show the linked PR.
>> >
>> > On Fri, 21 Feb 2020 at 09:58, Wenchen Fan  wrote:
>> >>
>> >> The JIRA ticket will show the linked PR if there are any, which
>> indicates that someone is working on it if the PR is active. Maybe the bot
>> should also leave a comment on the JIRA ticket to make it clearer?
>> >>
>> >> On Fri, Feb 21, 2020 at 10:54 PM younggyu Chun <
>> younggyuchu...@gmail.com> wrote:
>> >>>
>> >>> Hi All,
>> >>>
>> >>> I would like to suggest to use "Assignee" functionality in the JIRA
>> when we are working on a project. When we pick a ticket to work on we don't
>> know who is doing that right now.
>> >>>
>> >>> Recently I spent my time to solve an issue and made a merge request
>> but this was actually a duplicate work. The ticket I was working on doesn't
>> have any clues that somebody was working.
>> >>>
>> >>> are there ways to avoid duplicate work that I don't know yet?
>> >>>
>> >>> Thank you,
>> >>> Younggyu
>> >>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
---
Takeshi Yamamuro


Re: [DISCUSSION] Esoteric Spark function `TRIM/LTRIM/RTRIM`

2020-02-16 Thread Takeshi Yamamuro
-+-+---+
>>>> * MySQL doesn't allow multiple trim character
>>>> * Spark 2.3 ~ 2.4 have the function in a different way.
>>>>
>>>> Here is the illustrative example of the problem.
>>>>
>>>> postgres=# SELECT trim('yxTomxx', 'xyz');
>>>> btrim
>>>> ---
>>>> Tom
>>>>
>>>> presto:default> SELECT trim('yxTomxx', 'xyz');
>>>> _col0
>>>> ---
>>>> Tom
>>>>
>>>> spark-sql> SELECT trim('yxTomxx', 'xyz');
>>>> z
>>>>
>>>> Here is our history to fix the above issue.
>>>>
>>>> [SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order
>>>> issue
>>>> 1. https://github.com/apache/spark/pull/24902
>>>>(Merged 2019-06-18 for v3.0.0, 3.0.0-preview and 3.0.0-preview2
>>>> released.)
>>>> 2. https://github.com/apache/spark/pull/24907
>>>>(Merged 2019-06-20 for v2.3.4, but reverted)
>>>> 3. https://github.com/apache/spark/pull/24908
>>>>(Merged 2019-06-21 for v2.4.4, but reverted)
>>>>
>>>> (2) and (3) were reverted before releases because we didn't want to fix
>>>> that in the maintenance releases. Please see the following references of
>>>> the decision.
>>>>
>>>> https://github.com/apache/spark/pull/24908#issuecomment-504799028
>>>> (2.3)
>>>> https://github.com/apache/spark/pull/24907#issuecomment-504799021
>>>> (2.4)
>>>>
>>>> Now, there are some requests to revert SPARK-28093 and to keep these
>>>> esoteric functions for backward compatibility and the following reason.
>>>>
>>>> > Reordering function parameters to match another system,
>>>> > for a method that is otherwise working correctly,
>>>> > sounds exactly like a cosmetic change to me.
>>>>
>>>> > How can we silently change the parameter of an existing SQL
>>>> function?
>>>> > I don't think this is a correctness issue as the SQL standard
>>>> > doesn't say that the function signature have to be trim(srcStr,
>>>> trimStr).
>>>>
>>>> The concern and the point of views make sense.
>>>>
>>>> My concerns are the followings.
>>>>
>>>> 1. This kind of esoteric differences are called `vendor-lock-in`
>>>> stuffs in a negative way.
>>>>- It's difficult for new users to understand.
>>>>- It's hard to migrate between Apache Spark and the others.
>>>> 2. Although we did our bests, Apache Spark SQL has not been enough
>>>> always.
>>>> 3. We need to do improvement in the future releases.
>>>>
>>>> In short, we can keep 3.0.0-preview behaviors here or revert
>>>> SPARK-28093 in order to keep this vendor-lock-in stuffs for
>>>> backward-compatibility.
>>>>
>>>> What I want is building a consistent point of views in this category
>>>> for the upcoming PR reviews.
>>>>
>>>> If we have a clear policy, we can save future community efforts in many
>>>> ways.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>

-- 
---
Takeshi Yamamuro


Re: [DISCUSS] naming policy of Spark configs

2020-02-12 Thread Takeshi Yamamuro
+1; the idea sounds reasonable.

Bests,
Takeshi

On Thu, Feb 13, 2020 at 12:39 PM Wenchen Fan  wrote:

> Hi Dongjoon,
>
> It's too much work to revisit all the configs that added in 3.0, but I'll
> revisit the recent commits that update config names and see if they follow
> the new policy.
>
>
> Hi Reynold,
>
> There are a few interval configs:
> spark.sql.streaming.fileSink.log.compactInterval
> spark.sql.streaming.continuous.executorPollIntervalMs
>
> I think it's better to put the interval unit in the config name, like
> `executorPollIntervalMs`. Also the config should be created with
> `.timeConf`, so that users can set values like "1 second", "2 minutes", etc.
>
> There is no config that uses date/timestamp as value AFAIK.
>
>
> Thanks,
> Wenchen
>
> On Thu, Feb 13, 2020 at 11:29 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> +1 Thanks for the proposal. Looks very reasonable to me.
>>
>> On Thu, Feb 13, 2020 at 10:53 AM Hyukjin Kwon 
>> wrote:
>>
>>> +1.
>>>
>>> 2020년 2월 13일 (목) 오전 9:30, Gengliang Wang 님이
>>> 작성:
>>>
>>>> +1, this is really helpful. We should make the SQL configurations
>>>> consistent and more readable.
>>>>
>>>> On Wed, Feb 12, 2020 at 3:33 PM Rubén Berenguel 
>>>> wrote:
>>>>
>>>>> I love it, it will make configs easier to read and write. Thanks
>>>>> Wenchen.
>>>>>
>>>>> R
>>>>>
>>>>> On 13 Feb 2020, at 00:15, Dongjoon Hyun 
>>>>> wrote:
>>>>>
>>>>> 
>>>>> Thank you, Wenchen.
>>>>>
>>>>> The new policy looks clear to me. +1 for the explicit policy.
>>>>>
>>>>> So, are we going to revise the existing conf names before 3.0.0
>>>>> release?
>>>>>
>>>>> Or, is it applied to new up-coming configurations from now?
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>> On Wed, Feb 12, 2020 at 7:43 AM Wenchen Fan 
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'd like to discuss the naming policy of Spark configs, as for now it
>>>>>> depends on personal preference which leads to inconsistent namings.
>>>>>>
>>>>>> In general, the config name should be a noun that describes its
>>>>>> meaning clearly.
>>>>>> Good examples:
>>>>>> spark.sql.session.timeZone
>>>>>> spark.sql.streaming.continuous.executorQueueSize
>>>>>> spark.sql.statistics.histogram.numBins
>>>>>> Bad examples:
>>>>>> spark.sql.defaultSizeInBytes (default size for what?)
>>>>>>
>>>>>> Also note that, config name has many parts, joined by dots. Each part
>>>>>> is a namespace. Don't create namespace unnecessarily.
>>>>>> Good example:
>>>>>> spark.sql.execution.rangeExchange.sampleSizePerPartition
>>>>>> spark.sql.execution.arrow.maxRecordsPerBatch
>>>>>> Bad examples:
>>>>>> spark.sql.windowExec.buffer.in.memory.threshold ("in" is not a
>>>>>> useful namespace, better to be .buffer.inMemoryThreshold)
>>>>>>
>>>>>> For a big feature, usually we need to create an umbrella config to
>>>>>> turn it on/off, and other configs for fine-grained controls. These 
>>>>>> configs
>>>>>> should share the same namespace, and the umbrella config should be named
>>>>>> like featureName.enabled. For example:
>>>>>> spark.sql.cbo.enabled
>>>>>> spark.sql.cbo.starSchemaDetection
>>>>>> spark.sql.cbo.starJoinFTRatio
>>>>>> spark.sql.cbo.joinReorder.enabled
>>>>>> spark.sql.cbo.joinReorder.dp.threshold (BTW "dp" is not a good
>>>>>> namespace)
>>>>>> spark.sql.cbo.joinReorder.card.weight (BTW "card" is not a good
>>>>>> namespace)
>>>>>>
>>>>>> For boolean configs, in general it should end with a verb, e.g.
>>>>>> spark.sql.join.preferSortMergeJoin. If the config is for a feature
>>>>>> and you can't find a good verb for the feature, featureName.enabled
>>>>>> is also good.
>>>>>>
>>>>>> I'll update https://spark.apache.org/contributing.html after we
>>>>>> reach a consensus here. Any comments are welcome!
>>>>>>
>>>>>> Thanks,
>>>>>> Wenchen
>>>>>>
>>>>>>
>>>>>>

-- 
---
Takeshi Yamamuro


Re: [build system] enabled the ubuntu staging node to help w/build queue

2020-02-11 Thread Takeshi Yamamuro
Thanks always..!

Bests,
Takeshi

On Wed, Feb 12, 2020 at 3:28 AM shane knapp ☠  wrote:

> the build queue has been increasing and to help throughput i enabled the
> 'ubuntu-testing' node.  i spot-checked a bunch of the spark maven builds,
> and they passed.
>
> i'll keep an eye out for any failures caused by the system and either
> remove it from the worker pool of fix what i need to.
>
> shane
> --
> Shane Knapp
> Computer Guy / Voice of Reason
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Announcing Apache Spark 2.4.5

2020-02-08 Thread Takeshi Yamamuro
Happy to hear the release news!

Bests,
Takeshi

On Sun, Feb 9, 2020 at 10:28 AM Dongjoon Hyun 
wrote:

> There was a typo in one URL. The correct release note URL is here.
>
> https://spark.apache.org/releases/spark-release-2-4-5.html
>
>
>
> On Sat, Feb 8, 2020 at 5:22 PM Dongjoon Hyun 
> wrote:
>
>> We are happy to announce the availability of Spark 2.4.5!
>>
>> Spark 2.4.5 is a maintenance release containing stability fixes. This
>> release is based on the branch-2.4 maintenance branch of Spark. We
>> strongly
>> recommend all 2.4 users to upgrade to this stable release.
>>
>> To download Spark 2.4.5, head over to the download page:
>> http://spark.apache.org/downloads.html
>>
>> Note that you might need to clear your browser cache or
>> to use `Private`/`Incognito` mode according to your browsers.
>>
>> To view the release notes:
>> https://spark.apache.org/releases/spark-release-2.4.5.html
>>
>> We would like to acknowledge all community members for contributing to
>> this
>> release. This release would not have been possible without you.
>>
>> Dongjoon Hyun
>>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Apache Spark 2.4.5 (RC2)

2020-02-04 Thread Takeshi Yamamuro
+1;
 I run the tests with
`-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
-Psparkr`
on macOS (Java 8).
All the things look fine in my env.

Bests,
Takeshi

On Tue, Feb 4, 2020 at 12:35 PM Hyukjin Kwon  wrote:

> +1 from me too.
>
> 2020년 2월 4일 (화) 오후 12:26, Wenchen Fan 님이 작성:
>
>> AFAIK there is no ongoing critical bug fixes, +1
>>
>> On Mon, Feb 3, 2020 at 11:46 PM Dongjoon Hyun 
>> wrote:
>>
>>> Yes, it does officially since 2.4.0.
>>>
>>> 2.4.5 is a maintenance release of 2.4.x line and the community didn't
>>> support Hadoop 3.x on 'branch-2.4'. We didn't run test at all.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Sun, Feb 2, 2020 at 22:58 Ajith shetty 
>>> wrote:
>>>
>>>> Is hadoop-3.1 profile supported for this release.? i see lot of UTs
>>>> failing under this profile.
>>>> https://github.com/apache/spark/blob/v2.4.5-rc2/pom.xml
>>>>
>>>> *Example:*
>>>>  [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
>>>> [ERROR] Tests run: 3, Failures: 0, Errors: 3, Skipped: 0, Time elapsed:
>>>> 1.717 s <<< FAILURE! - in
>>>> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite
>>>> [ERROR]
>>>> saveExternalTableAndQueryIt(org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite)
>>>> Time elapsed: 1.675 s  <<< ERROR!
>>>> java.lang.ExceptionInInitializerError
>>>> at
>>>> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
>>>> Caused by: java.lang.IllegalArgumentException: *Unrecognized Hadoop
>>>> major version number: 3.1.0*
>>>> at
>>>> org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite.setUp(JavaMetastoreDataSourcesSuite.java:66)
>>>>
>>>

-- 
---
Takeshi Yamamuro


Re: Block a user from spark-website who repeatedly open the invalid same PR

2020-01-26 Thread Takeshi Yamamuro
+1

Bests,
Takeshi

On Sun, Jan 26, 2020 at 3:05 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I am thinking about opening an infra ticket to block @DataWanderer
> <https://github.com/DataWanderer> user from spark-website
> repository, who repeatedly opens the invalid PRs.
>
> The PR is about fix a documentation in the released version 2.4.4, and it
> should be fixed in spark
> repository. It was explained multiple times by me and Sean but this user
> opens the same PR
> repeatedly which brings overhead to the dev.
>
> See the PRs below:
>
> https://github.com/apache/spark-website/pull/257
> https://github.com/apache/spark-website/pull/256
> https://github.com/apache/spark-website/pull/255
> https://github.com/apache/spark-website/pull/254
> https://github.com/apache/spark-website/pull/250
> https://github.com/apache/spark-website/pull/249
>
> If there is no objection, and this guy opens the PR again, I am going to
> open an infra ticket to block
> this guy from spark-webiste repo to prevent such behaviours.
>
> Please let me know if you guys have any concerns.
>
>

-- 
---
Takeshi Yamamuro


Re: More publicly documenting the options under spark.sql.*

2020-01-16 Thread Takeshi Yamamuro
The idea looks nice. I think web documents always help end users.

Bests,
Takeshi

On Fri, Jan 17, 2020 at 4:04 AM Shixiong(Ryan) Zhu 
wrote:

> "spark.sql("set -v")" returns a Dataset that has all non-internal SQL
> configurations. Should be pretty easy to automatically generate a SQL
> configuration page.
>
> Best Regards,
> Ryan
>
>
> On Wed, Jan 15, 2020 at 5:47 AM Hyukjin Kwon  wrote:
>
>> I think automatically creating a configuration page isn't a bad idea
>> because I think we deprecate and remove configurations which are not
>> created via .internal() in SQLConf anyway.
>>
>> I already tried this automatic generation from the codes at SQL built-in
>> functions and I'm pretty sure we can do the similar thing for
>> configurations as well.
>>
>> We could perhaps mimic what hadoop does
>> https://hadoop.apache.org/docs/r2.8.0/hadoop-project-dist/hadoop-common/core-default.xml
>>
>> On Wed, 15 Jan 2020, 10:46 Sean Owen,  wrote:
>>
>>> Some of it is intentionally undocumented, as far as I know, as an
>>> experimental option that may change, or legacy, or safety valve flag.
>>> Certainly anything that's marked an internal conf. (That does raise
>>> the question of who it's for, if you have to read source to find it.)
>>>
>>> I don't know if we need to overhaul the conf system, but there may
>>> indeed be some confs that could legitimately be documented. I don't
>>> know which.
>>>
>>> On Tue, Jan 14, 2020 at 7:32 PM Nicholas Chammas
>>>  wrote:
>>> >
>>> > I filed SPARK-30510 thinking that we had forgotten to document an
>>> option, but it turns out that there's a whole bunch of stuff under
>>> SQLConf.scala that has no public documentation under
>>> http://spark.apache.org/docs.
>>> >
>>> > Would it be appropriate to somehow automatically generate a
>>> documentation page from SQLConf.scala, as Hyukjin suggested on that ticket?
>>> >
>>> > Another thought that comes to mind is moving the config definitions
>>> out of Scala and into a data format like YAML or JSON, and then sourcing
>>> that both for SQLConf as well as for whatever documentation page we want to
>>> generate. What do you think of that idea?
>>> >
>>> > Nick
>>> >
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Apache Spark 2.4.5 (RC1)

2020-01-14 Thread Takeshi Yamamuro
s if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> > >> >
> > >> > [ ] +1 Release this package as Apache Spark 2.4.5
> > >> > [ ] -1 Do not release this package because ...
> > >> >
> > >> > To learn more about Apache Spark, please see
> http://spark.apache.org/
> > >> >
> > >> > The tag to be voted on is v2.4.5-rc1 (commit
> 33bd2beee5e3772a9af1d782f195e6a678c54cf0):
> > >> > https://github.com/apache/spark/tree/v2.4.5-rc1
> > >> >
> > >> > The release files, including signatures, digests, etc. can be found
> at:
> > >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc1-bin/
> > >> >
> > >> > Signatures used for Spark RCs can be found in this file:
> > >> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> > >> >
> > >> > The staging repository for this release can be found at:
> > >> >
> https://repository.apache.org/content/repositories/orgapachespark-1339/
> > >> >
> > >> > The documentation corresponding to this release can be found at:
> > >> > https://dist.apache.org/repos/dist/dev/spark/v2.4.5-rc1-docs/
> > >> >
> > >> > The list of bug fixes going into 2.4.5 can be found at the
> following URL:
> > >> > https://issues.apache.org/jira/projects/SPARK/versions/12346042
> > >> >
> > >> > This release is using the release script of the tag v2.4.5-rc1.
> > >> >
> > >> > FAQ
> > >> >
> > >> > =
> > >> > How can I help test this release?
> > >> > =
> > >> >
> > >> > If you are a Spark user, you can help us test this release by taking
> > >> > an existing Spark workload and running on this release candidate,
> then
> > >> > reporting any regressions.
> > >> >
> > >> > If you're working in PySpark you can set up a virtual env and
> install
> > >> > the current RC and see if anything important breaks, in the
> Java/Scala
> > >> > you can add the staging repository to your projects resolvers and
> test
> > >> > with the RC (make sure to clean up the artifact cache before/after
> so
> > >> > you don't end up building with a out of date RC going forward).
> > >> >
> > >> > ===
> > >> > What should happen to JIRA tickets still targeting 2.4.5?
> > >> > ===
> > >> >
> > >> > The current list of open tickets targeted at 2.4.5 can be found at:
> > >> > https://issues.apache.org/jira/projects/SPARK and search for
> "Target Version/s" = 2.4.5
> > >> >
> > >> > Committers should look at those and triage. Extremely important bug
> > >> > fixes, documentation, and API tweaks that impact compatibility
> should
> > >> > be worked on immediately. Everything else please retarget to an
> > >> > appropriate release.
> > >> >
> > >> > ==
> > >> > But my bug isn't fixed?
> > >> > ==
> > >> >
> > >> > In order to make timely releases, we will typically not hold the
> > >> > release unless the bug in question is a regression from the previous
> > >> > release. That being said, if there is something which is a
> regression
> > >> > that has not been correctly targeted please ping me or a committer
> to
> > >> > help target the issue.
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Release Apache Spark 2.4.5

2020-01-07 Thread Takeshi Yamamuro
+1, the late response... :(
Anyway, happy new year, all!

Bests,
Takeshi

On Tue, Jan 7, 2020 at 2:50 AM Dongjoon Hyun 
wrote:

> Thank you all.
>
> I'll start to check and prepare the 2.4.5 release.
>
> Bests,
> Dongjoon.
>
> On Sun, Jan 5, 2020 at 22:51 Xiao Li  wrote:
>
>> +1
>>
>> Xiao
>>
>> On Sun, Jan 5, 2020 at 9:50 PM Holden Karau  wrote:
>>
>>> +1
>>>
>>> On Sun, Jan 5, 2020 at 9:40 PM Wenchen Fan  wrote:
>>>
>>>> +1
>>>>
>>>> On Mon, Jan 6, 2020 at 12:02 PM Jungtaek Lim <
>>>> kabhwan.opensou...@gmail.com> wrote:
>>>>
>>>>> +1 to have another Spark 2.4 release, as Spark 2.4.4 was released in 4
>>>>> months old and there's release window for this.
>>>>>
>>>>> On Mon, Jan 6, 2020 at 12:38 PM Hyukjin Kwon 
>>>>> wrote:
>>>>>
>>>>>> Yeah, I think it's nice to have another maintenance release given
>>>>>> Spark 3.0 timeline.
>>>>>>
>>>>>> 2020년 1월 6일 (월) 오전 7:58, Dongjoon Hyun 님이
>>>>>> 작성:
>>>>>>
>>>>>>> Hi, All.
>>>>>>>
>>>>>>> Happy New Year (2020)!
>>>>>>>
>>>>>>> Although we slightly missed the timeline for 3.0 branch cut last
>>>>>>> month,
>>>>>>> it seems that we keep 2.4.x timeline on track.
>>>>>>>
>>>>>>> https://spark.apache.org/versioning-policy.html
>>>>>>>
>>>>>>> As of today, `branch-2.4` has 154 patches since v2.4.4.
>>>>>>>
>>>>>>> $ git log --oneline v2.4.4..HEAD | wc -l
>>>>>>> 154
>>>>>>>
>>>>>>> Shall we start to vote Apache Spark 2.4.5 next week (around January
>>>>>>> 13rd, Monday)?
>>>>>>> It would be great if there is a fresh new release manager volunteer,
>>>>>>> but I also can do the release if people are busy.
>>>>>>>
>>>>>>> How do you think about the 2.4.4 release?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dongjoon.
>>>>>>>
>>>>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro


Re: [ANNOUNCE] Announcing Apache Spark 3.0.0-preview2

2019-12-24 Thread Takeshi Yamamuro
Great work, Yuming!

Bests,
Takeshi

On Wed, Dec 25, 2019 at 6:00 AM Xiao Li  wrote:

> Thank you all. Happy Holidays!
>
> Xiao
>
> On Tue, Dec 24, 2019 at 12:53 PM Yuming Wang  wrote:
>
>> Hi all,
>>
>> To enable wide-scale community testing of the upcoming Spark 3.0 release,
>> the Apache Spark community has posted a new preview release of Spark 3.0.
>> This preview is *not a stable release in terms of either API or
>> functionality*, but it is meant to give the community early access to
>> try the code that will become Spark 3.0. If you would like to test the
>> release, please download it, and send feedback using either the mailing
>> lists <https://spark.apache.org/community.html> or JIRA
>> <https://issues.apache.org/jira/projects/SPARK?selectedItem=com.atlassian.jira.jira-projects-plugin%3Asummary-page>
>> .
>>
>> There are a lot of exciting new features added to Spark 3.0, including
>> Dynamic Partition Pruning, Adaptive Query Execution, Accelerator-aware
>> Scheduling, Data Source API with Catalog Supports, Vectorization in SparkR,
>> support of Hadoop 3/JDK 11/Scala 2.12, and many more. For a full list of
>> major features and changes in Spark 3.0.0-preview2, please check the thread(
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-feature-list-and-major-changes-td28050.html
>>  and
>> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-3-0-preview-release-2-td28491.html
>> ).
>>
>> We'd like to thank our contributors and users for their contributions and
>> early feedback to this release. This release would not have been possible
>> without you.
>>
>> To download Spark 3.0.0-preview2, head over to the download page:
>> https://archive.apache.org/dist/spark/spark-3.0.0-preview2
>>
>> Happy Holidays.
>>
>> Yuming
>>
>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
---
Takeshi Yamamuro


Re: Spark 3.0 branch cut and code freeze on Jan 31?

2019-12-24 Thread Takeshi Yamamuro
lerBackend
>>> SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
>>> SPARK-25186 Stabilize Data Source V2 API
>>> SPARK-25376 Scenarios we should handle but missed in 2.4 for barrier
>>> execution mode
>>> SPARK-7768 Make user-defined type (UDT) API public
>>> SPARK-14922 Alter Table Drop Partition Using Predicate-based Partition
>>> Spec
>>> SPARK-15694 Implement ScriptTransformation in sql/core
>>> SPARK-18134 SQL: MapType in Group BY and Joins not working
>>> SPARK-19842 Informational Referential Integrity Constraints Support in
>>> Spark
>>> SPARK-22231 Support of map, filter, withColumn, dropColumn in nested
>>> list of structures
>>> SPARK-22386 Data Source V2 improvements
>>> SPARK-24723 Discuss necessary info and access in barrier mode + YARN
>>>
>>>
>>> On Mon, Dec 23, 2019 at 5:48 PM Reynold Xin  wrote:
>>>
>>>> We've pushed out 3.0 multiple times. The latest release window
>>>> documented on the website
>>>> <http://spark.apache.org/versioning-policy.html> says we'd code freeze
>>>> and cut branch-3.0 early Dec. It looks like we are suffering a bit from the
>>>> tragedy of the commons, that nobody is pushing for getting the release out.
>>>> I understand the natural tendency for each individual is to finish or
>>>> extend the feature/bug that the person has been working on. At some point
>>>> we need to say "this is it" and get the release out. I'm happy to help
>>>> drive this process.
>>>>
>>>> To be realistic, I don't think we should just code freeze *today*.
>>>> Although we have updated the website, contributors have all been operating
>>>> under the assumption that all active developments are still going on. I
>>>> propose we *cut the branch on **Jan 31**, and code freeze and switch
>>>> over to bug squashing mode, and try to get the 3.0 official release out in
>>>> Q1*. That is, by default no new features can go into the branch
>>>> starting Jan 31.
>>>>
>>>> What do you think?
>>>>
>>>> And happy holidays everybody.
>>>>
>>>>
>>>>
>>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE][RESULT] SPARK 3.0.0-preview2 (RC2)

2019-12-22 Thread Takeshi Yamamuro
Thanks for the work, Yuming, and Happy Holiday, all!

Bests,
Takeshi

On Mon, Dec 23, 2019 at 7:54 AM Xiao Li  wrote:

> This is the fastest release! Thank you all for making this happen.
>
> Happy Holiday!
>
> Xiao
>
> On Sun, Dec 22, 2019 at 10:58 AM Dongjoon Hyun 
> wrote:
>
>> Thank you all. Especially, Yuming as a release manager!
>> Happy Holidays!
>>
>> Cheers,
>> Dongjoon.
>>
>>
>> On Sun, Dec 22, 2019 at 12:51 AM Yuming Wang  wrote:
>>
>>> Hi, All.
>>>
>>> The vote passes. Thanks to all who helped with this release
>>> 3.0.0-preview2!
>>> I'll follow up later with a release announcement once everything is
>>> published.
>>>
>>> +1 (* = binding):
>>> - Sean Owen *
>>> - Dongjoon Hyun *
>>> - Takeshi Yamamuro *
>>> - Wenchen Fan *
>>>
>>> +0: None
>>>
>>> -1: None
>>>
>>>
>>>
>>>
>>> Regards,
>>> Yuming
>>>
>>
>
> --
> [image: Databricks Summit - Watch the talks]
> <https://databricks.com/sparkaisummit/north-america>
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] SPARK 3.0.0-preview2 (RC2)

2019-12-18 Thread Takeshi Yamamuro
Thanks, Yuming!

I checked the links and the prepared binaries.
Also, I run tests with  -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver
-Pmesos -Pkubernetes -Psparkr
on java version "1.8.0_181.
All the things above look fine.

Bests,
Takeshi

On Thu, Dec 19, 2019 at 6:31 AM Dongjoon Hyun 
wrote:

> +1
>
> I also check the signatures and docs. And, built and tested with JDK
> 11.0.5, Hadoop 3.2, Hive 2.3.
> In addition, the newly added
> `spark-3.0.0-preview2-bin-hadoop2.7-hive1.2.tgz` distribution looks correct.
>
> Thank you Yuming and all.
>
> Bests,
> Dongjoon.
>
>
> On Tue, Dec 17, 2019 at 4:11 PM Sean Owen  wrote:
>
>> Same result as last time. It all looks good and tests pass for me on
>> Ubuntu with all profiles enables (Hadoop 3.2 + Hive 2.3), building
>> from source.
>> 'pyspark-3.0.0.dev2.tar.gz' appears to be the desired python artifact
>> name, yes.
>> +1
>>
>> On Tue, Dec 17, 2019 at 12:36 AM Yuming Wang  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.0.0-preview2.
>> >
>> > The vote is open until December 20 PST and passes if a majority +1 PMC
>> votes are cast, with
>> > a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview2
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see http://spark.apache.org/
>> >
>> > The tag to be voted on is v3.0.0-preview2-rc2 (commit
>> bcadd5c3096109878fe26fb0d57a9b7d6fdaa257):
>> > https://github.com/apache/spark/tree/v3.0.0-preview2-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-bin/
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1338/
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview2-rc2-docs/
>> >
>> > The list of bug fixes going into 3.0.0 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> >
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with an out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.0.0?
>> > ===
>> >
>> > The current list of open tickets targeted at 3.0.0 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.0.0
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> >
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
---
Takeshi Yamamuro


Re: I would like to add JDBCDialect to support Vertica database

2019-12-11 Thread Takeshi Yamamuro
;
>>
>>
>> On Wed, Dec 11, 2019 at 6:52 AM Sean Owen  wrote:
>>
>> It's probably OK, IMHO. The overhead of another dialect is small. Are
>> there differences that require a new dialect? I assume so and might
>> just be useful to summarize them if you open a PR.
>>
>> On Tue, Dec 10, 2019 at 7:14 AM Bryan Herger
>>  wrote:
>> >
>> > Hi, I am a Vertica support engineer, and we have open support requests
>> around NULL values and SQL type conversion with DataFrame read/write over
>> JDBC when connecting to a Vertica database.  The stack traces point to
>> issues with the generic JDBCDialect in Spark-SQL.
>> >
>> > I saw that other vendors (Teradata, DB2...) have contributed a
>> JDBCDialect class to address JDBC compatibility, so I wrote up a dialect
>> for Vertica.
>> >
>> > The changeset is on my fork of apache/spark at
>> https://github.com/bryanherger/spark/commit/84d3014e4ead18146147cf299e8996c5c56b377d
>> >
>> > I have tested this against Vertica 9.3 and found that this changeset
>> addresses both issues reported to us (issue with NULL values - setNull() -
>> for valid java.sql.Types, and String to VARCHAR conversion)
>> >
>> > Is the an acceptable change?  If so, how should I go about submitting a
>> pull request?
>> >
>> > Thanks, Bryan Herger
>> > Vertica Solution Engineer
>> >
>> >
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>> --
>>
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>>
>>
>>
>> --
>>
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>>

-- 
---
Takeshi Yamamuro


Re: Release Apache Spark 2.4.5 and 2.4.6

2019-12-10 Thread Takeshi Yamamuro
That looks nice, thanks!
I checked the previous v2.4.4 release; it has around 130 commits (from
2.4.3 to 2.4.4), so
I think branch-2.4 already has enough commits for the next release.

A commit list from 2.4.3 to 2.4.4;
https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...7955b3962ac46b89564e0613db7bea98a1478bf2

Bests,
Takeshi

On Tue, Dec 10, 2019 at 3:32 AM Sean Owen  wrote:

> Sure, seems fine. The release cadence slows down in a branch over time
> as there is probably less to fix, so Jan-Feb 2020 for 2.4.5 and
> something like middle or Q3 2020 for 2.4.6 is a reasonable
> expectation. It might plausibly be the last 2.4.x release but who
> knows.
>
> On Mon, Dec 9, 2019 at 12:29 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Along with the discussion on 3.0.0, I'd like to discuss about the next
> releases on `branch-2.4`.
> >
> > As we know, `branch-2.4` is our LTS branch and also there exists some
> questions on the release plans. More releases are important not only for
> the latest K8s version support, but also for delivering important bug fixes
> regularly (at least until 3.x becomes dominant.)
> >
> > In short, I'd like to propose the followings.
> >
> > 1. Apache Spark 2.4.5 release (2020 January)
> > 2. Apache Spark 2.4.6 release (2020 July)
> >
> > Of course, we can adjust the schedule.
> > This aims to have a pre-defined cadence in order to give release
> managers to prepare.
> >
> > Bests,
> > Dongjoon.
> >
> > PS. As of now, `branch-2.4` has 135 additional patches after `2.4.4`.
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Spark 3.0 preview release 2?

2019-12-09 Thread Takeshi Yamamuro
+1; Looks great if we can in terms of user's feedbacks.

Bests,
Takeshi

On Tue, Dec 10, 2019 at 3:14 AM Dongjoon Hyun 
wrote:

> Thank you, All.
>
> +1 for another `3.0-preview`.
>
> Also, thank you Yuming for volunteering for that!
>
> Bests,
> Dongjoon.
>
>
> On Mon, Dec 9, 2019 at 9:39 AM Xiao Li  wrote:
>
>> When entering the official release candidates, the new features have to
>> be disabled or even reverted [if the conf is not available] if the fixes
>> are not trivial; otherwise, we might need 10+ RCs to make the final
>> release. The new features should not block the release based on the
>> previous discussions.
>>
>> I agree we should have code freeze at the beginning of 2020. The preview
>> releases should not block the official releases. The preview is just to
>> collect more feedback about these new features or behavior changes.
>>
>> Also, for the release of Spark 3.0, we still need the Hive community to
>> do us a favor to release 2.3.7 for having HIVE-22190
>> <https://issues.apache.org/jira/browse/HIVE-22190>. Before asking Hive
>> community to do 2.3.7 release, if possible, we want our Spark community to
>> have more tries, especially the support of JDK 11 on Hadoop 2.7 and 3.2,
>> which is based on Hive 2.3 execution JAR. During the preview stage, we
>> might find more issues that are not covered by our test cases.
>>
>>
>>
>> On Mon, Dec 9, 2019 at 4:55 AM Sean Owen  wrote:
>>
>>> Seems fine to me of course. Honestly that wouldn't be a bad result for
>>> a release candidate, though we would probably roll another one now.
>>> How about simply moving to a release candidate? If not now then at
>>> least move to code freeze from the start of 2020. There is also some
>>> downside in pushing out the 3.0 release further with previews.
>>>
>>> On Mon, Dec 9, 2019 at 12:32 AM Xiao Li  wrote:
>>> >
>>> > I got many great feedbacks from the community about the recent 3.0
>>> preview release. Since the last 3.0 preview release, we already have 353
>>> commits [https://github.com/apache/spark/compare/v3.0.0-preview...master].
>>> There are various important features and behavior changes we want the
>>> community to try before entering the official release candidates of Spark
>>> 3.0.
>>> >
>>> >
>>> > Below is my selected items that are not part of the last 3.0 preview
>>> but already available in the upstream master branch:
>>> >
>>> > Support JDK 11 with Hadoop 2.7
>>> > Spark SQL will respect its own default format (i.e., parquet) when
>>> users do CREATE TABLE without USING or STORED AS clauses
>>> > Enable Parquet nested schema pruning and nested pruning on expressions
>>> by default
>>> > Add observable Metrics for Streaming queries
>>> > Column pruning through nondeterministic expressions
>>> > RecordBinaryComparator should check endianness when compared by long
>>> > Improve parallelism for local shuffle reader in adaptive query
>>> execution
>>> > Upgrade Apache Arrow to version 0.15.1
>>> > Various interval-related SQL support
>>> > Add a mode to pin Python thread into JVM's
>>> > Provide option to clean up completed files in streaming query
>>> >
>>> > I am wondering if we can have another preview release for Spark 3.0?
>>> This can help us find the design/API defects as early as possible and avoid
>>> the significant delay of the upcoming Spark 3.0 release
>>> >
>>> >
>>> > Also, any committer is willing to volunteer as the release manager of
>>> the next preview release of Spark 3.0, if we have such a release?
>>> >
>>> >
>>> > Cheers,
>>> >
>>> >
>>> > Xiao
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro


Re: FYI: SPARK-30098 Use default datasource as provider for CREATE TABLE syntax

2019-12-06 Thread Takeshi Yamamuro
Oh, looks nice. Thanks for the sharing, Dongjoon

Bests,
Takeshi

On Sat, Dec 7, 2019 at 3:35 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> I want to share the following change to the community.
>
> SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
>
> This is merged today and now Spark's `CREATE TABLE` is using Spark's
> default data sources instead of `hive` provider. This is a good and big
> improvement for Apache Spark 3.0, but this might surprise someone. (Please
> note that there is a fallback option for them.)
>
> Thank you, Yi, Wenchen, Xiao.
>
> Cheers,
> Dongjoon.
>


-- 
---
Takeshi Yamamuro


Re: [DISCUSS] PostgreSQL dialect

2019-11-26 Thread Takeshi Yamamuro
e start an effort to achieve feature parity between Spark and
>>> PostgreSQL: https://issues.apache.org/jira/browse/SPARK-27764
>>>
>>> This goes very well. We've added many missing features(parser rules,
>>> built-in functions, etc.) to Spark, and also corrected several
>>> inappropriate behaviors of Spark to follow SQL standard and PostgreSQL.
>>> Many thanks to all the people that contribute to it!
>>>
>>> There are several cases when adding a PostgreSQL feature:
>>> 1. Spark doesn't have this feature: just add it.
>>> 2. Spark has this feature, but the behavior is different:
>>> 2.1 Spark's behavior doesn't make sense: change it to follow SQL
>>> standard and PostgreSQL, with a legacy config to restore the behavior.
>>> 2.2 Spark's behavior makes sense but violates SQL standard: change
>>> the behavior to follow SQL standard and PostgreSQL, when the ansi mode is
>>> enabled (default false).
>>> 2.3 Spark's behavior makes sense and doesn't violate SQL standard:
>>> adds the PostgreSQL behavior under the PostgreSQL dialect (default is Spark
>>> native dialect).
>>>
>>> The PostgreSQL dialect itself is a good idea. It can help users to
>>> migrate PostgreSQL workloads to Spark. Other databases have this strategy
>>> too. For example, DB2 provides an oracle dialect
>>> <https://www.ibm.com/developerworks/data/library/techarticle/dm-0907oracleappsondb2/index.html>
>>> .
>>>
>>> However, there are so many differences between Spark and PostgreSQL,
>>> including SQL parsing, type coercion, function/operator behavior, data
>>> types, etc. I'm afraid that we may spend a lot of effort on it, and make
>>> the Spark codebase pretty complicated, but still not able to provide a
>>> usable PostgreSQL dialect.
>>>
>>> Furthermore, it's not clear to me how many users have the requirement of
>>> migrating PostgreSQL workloads. I think it's much more important to make
>>> Spark ANSI-compliant first, which doesn't need that much of work.
>>>
>>> Recently I've seen multiple PRs adding PostgreSQL cast functions, while
>>> our own cast function is not ANSI-compliant yet. This makes me think that,
>>> we should do something to properly prioritize ANSI mode over other dialects.
>>>
>>> Here I'm proposing to hold off the PostgreSQL dialect. Let's remove it
>>> from the codebase before it's too late. Curently we only have 3 features
>>> under PostgreSQL dialect:
>>> 1. when casting string to boolean, `t`, `tr`, `tru`, `yes`, .. are also
>>> allowed as true string.
>>> 2. `date - date`  returns interval in Spark (SQL standard behavior), but
>>> return int in PostgreSQL
>>> 3. `int / int` returns double in Spark, but returns int in PostgreSQL.
>>> (there is no standard)
>>>
>>> We should still add PostgreSQL features that Spark doesn't have, or
>>> Spark's behavior violates SQL standard. But for others, let's just update
>>> the answer files of PostgreSQL tests.
>>>
>>> Any comments are welcome!
>>>
>>> Thanks,
>>> Wenchen
>>>
>>> --
>>> Best regards,
>>> Maciej
>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro


Re: [build system] jenkins wedged, needed a quick restart

2019-11-12 Thread Takeshi Yamamuro
thx as always, Shane!


On Wed, Nov 13, 2019 at 3:25 AM Shane Knapp  wrote:

> it's coming back up now.
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Adding JIRA ID as the prefix for the test case name

2019-11-11 Thread Takeshi Yamamuro
+1 for having that consistent rule in test names.
This is a trivial problem though, I think documenting this rule in the
contribution guide
might be able to make reviewer overhead a little smaller.

Bests,
Takeshi

On Tue, Nov 12, 2019 at 1:46 AM Hyukjin Kwon  wrote:

> Hi all,
>
> Maybe it's not a big deal but it brought some confusions time to time into
> Spark dev and community. I think it's time to discuss about when/which
> format to add a JIRA ID as a prefix for the test case name in Scala test
> cases.
>
> Currently we have many test case names with prefixes as below:
>
>- test("SPARK-X blah blah")
>- test("SPARK-X: blah blah")
>- test("SPARK-X - blah blah")
>- test("[SPARK-X] blah blah")
>- …
>
> It is a good practice to have the JIRA ID in general because, for instance,
> it makes us put less efforts to track commit histories (or even when the
> files
> are totally moved), or to track related information of tests failed.
> Considering Spark's getting big, I think it's good to document.
>
> I would like to suggest this and document it in our guideline:
>
> 1. Add a prefix into a test name when a PR adds a couple of tests.
> 2. Uses "SPARK-: test name" format which is used in our code base most
>   often[1].
>
> We should make it simple and clear but closer to the actual practice. So,
> I would like to listen to what other people think. I would appreciate if
> you guys give some feedback about when to add the JIRA prefix. One
> alternative is that, we only add the prefix when the JIRA's type is bug.
>
> [1]
> git grep -E 'test\("\SPARK-([0-9]+):' | wc -l
>  923
> git grep -E 'test\("\SPARK-([0-9]+) ' | wc -l
>  477
> git grep -E 'test\("\[SPARK-([0-9]+)\]' | wc -l
>   16
> git grep -E 'test\("\SPARK-([0-9]+) -' | wc -l
>   13
>
>
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] SPARK 3.0.0-preview (RC2)

2019-11-01 Thread Takeshi Yamamuro
+1, too.

On Sat, Nov 2, 2019 at 3:36 AM Hyukjin Kwon  wrote:

> +1
>
> On Fri, 1 Nov 2019, 15:36 Wenchen Fan,  wrote:
>
>> The PR builder uses Hadoop 2.7 profile, which makes me think that 2.7 is
>> more stable and we should make releases using 2.7 by default.
>>
>> +1
>>
>> On Fri, Nov 1, 2019 at 7:16 AM Xiao Li  wrote:
>>
>>> Spark 3.0 will still use the Hadoop 2.7 profile by default, I think.
>>> Hadoop 2.7 profile is much more stable than Hadoop 3.2 profile.
>>>
>>> On Thu, Oct 31, 2019 at 3:54 PM Sean Owen  wrote:
>>>
>>>> This isn't a big thing, but I see that the pyspark build includes
>>>> Hadoop 2.7 rather than 3.2. Maybe later we change the build to put in
>>>> 3.2 by default.
>>>>
>>>> Otherwise, the tests all seems to pass with JDK 8 / 11 with all
>>>> profiles enabled, so I'm +1 on it.
>>>>
>>>>
>>>> On Thu, Oct 31, 2019 at 1:00 AM Xingbo Jiang 
>>>> wrote:
>>>> >
>>>> > Please vote on releasing the following candidate as Apache Spark
>>>> version 3.0.0-preview.
>>>> >
>>>> > The vote is open until November 3 PST and passes if a majority +1 PMC
>>>> votes are cast, with
>>>> > a minimum of 3 +1 votes.
>>>> >
>>>> > [ ] +1 Release this package as Apache Spark 3.0.0-preview
>>>> > [ ] -1 Do not release this package because ...
>>>> >
>>>> > To learn more about Apache Spark, please see http://spark.apache.org/
>>>> >
>>>> > The tag to be voted on is v3.0.0-preview-rc2 (commit
>>>> 007c873ae34f58651481ccba30e8e2ba38a692c4):
>>>> > https://github.com/apache/spark/tree/v3.0.0-preview-rc2
>>>> >
>>>> > The release files, including signatures, digests, etc. can be found
>>>> at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-bin/
>>>> >
>>>> > Signatures used for Spark RCs can be found in this file:
>>>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>>>> >
>>>> > The staging repository for this release can be found at:
>>>> >
>>>> https://repository.apache.org/content/repositories/orgapachespark-1336/
>>>> >
>>>> > The documentation corresponding to this release can be found at:
>>>> > https://dist.apache.org/repos/dist/dev/spark/v3.0.0-preview-rc2-docs/
>>>> >
>>>> > The list of bug fixes going into 3.0.0 can be found at the following
>>>> URL:
>>>> > https://issues.apache.org/jira/projects/SPARK/versions/12339177
>>>> >
>>>> > FAQ
>>>> >
>>>> > =
>>>> > How can I help test this release?
>>>> > =
>>>> >
>>>> > If you are a Spark user, you can help us test this release by taking
>>>> > an existing Spark workload and running on this release candidate, then
>>>> > reporting any regressions.
>>>> >
>>>> > If you're working in PySpark you can set up a virtual env and install
>>>> > the current RC and see if anything important breaks, in the Java/Scala
>>>> > you can add the staging repository to your projects resolvers and test
>>>> > with the RC (make sure to clean up the artifact cache before/after so
>>>> > you don't end up building with an out of date RC going forward).
>>>> >
>>>> > ===
>>>> > What should happen to JIRA tickets still targeting 3.0.0?
>>>> > ===
>>>> >
>>>> > The current list of open tickets targeted at 3.0.0 can be found at:
>>>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>>>> Version/s" = 3.0.0
>>>> >
>>>> > Committers should look at those and triage. Extremely important bug
>>>> > fixes, documentation, and API tweaks that impact compatibility should
>>>> > be worked on immediately.
>>>> >
>>>> > ==
>>>> > But my bug isn't fixed?
>>>> > ==
>>>> >
>>>> > In order to make timely releases, we will typically not hold the
>>>> > release unless the bug in question is a regression from the previous
>>>> > release. That being said, if there is something which is a regression
>>>> > that has not been correctly targeted please ping me or a committer to
>>>> > help target the issue.
>>>>
>>>> -
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>> --
>>> [image: Databricks Summit - Watch the talks]
>>> <https://databricks.com/sparkaisummit/north-america>
>>>
>>

-- 
---
Takeshi Yamamuro


Re: [DISCUSS] Deprecate Python < 3.6 in Spark 3.0

2019-10-29 Thread Takeshi Yamamuro
+1, too.

On Tue, Oct 29, 2019 at 4:16 PM Holden Karau  wrote:

> +1 to deprecating but not yet removing support for 3.6
>
> On Tue, Oct 29, 2019 at 3:47 AM Shane Knapp  wrote:
>
>> +1 to testing the absolute minimum number of python variants as
>> possible.  ;)
>>
>> On Mon, Oct 28, 2019 at 7:46 PM Hyukjin Kwon  wrote:
>>
>>> +1 from me as well.
>>>
>>> 2019년 10월 29일 (화) 오전 5:34, Xiangrui Meng 님이 작성:
>>>
>>>> +1. And we should start testing 3.7 and maybe 3.8 in Jenkins.
>>>>
>>>> On Thu, Oct 24, 2019 at 9:34 AM Dongjoon Hyun 
>>>> wrote:
>>>>
>>>>> Thank you for starting the thread.
>>>>>
>>>>> In addition to that, we currently are testing Python 3.6 only in
>>>>> Apache Spark Jenkins environment.
>>>>>
>>>>> Given that Python 3.8 is already out and Apache Spark 3.0.0 RC1 will
>>>>> start next January
>>>>> (https://spark.apache.org/versioning-policy.html), I'm +1 for the
>>>>> deprecation (Python < 3.6) at Apache Spark 3.0.0.
>>>>>
>>>>> It's just a deprecation to prepare the next-step development cycle.
>>>>>
>>>>> Bests,
>>>>> Dongjoon.
>>>>>
>>>>>
>>>>> On Thu, Oct 24, 2019 at 1:10 AM Maciej Szymkiewicz <
>>>>> mszymkiew...@gmail.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> While deprecation of Python 2 in 3.0.0 has been announced
>>>>>> <https://spark.apache.org/news/plan-for-dropping-python-2-support.html>,
>>>>>> there is no clear statement about specific continuing support of 
>>>>>> different
>>>>>> Python 3 version.
>>>>>>
>>>>>> Specifically:
>>>>>>
>>>>>>- Python 3.4 has been retired this year.
>>>>>>- Python 3.5 is already in the "security fixes only" mode and
>>>>>>should be retired in the middle of 2020.
>>>>>>
>>>>>> Continued support of these two blocks adoption of many new Python
>>>>>> features (PEP 468)  and it is hard to justify beyond 2020.
>>>>>>
>>>>>> Should these two be deprecated in 3.0.0 as well?
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Maciej
>>>>>>
>>>>>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>


-- 
---
Takeshi Yamamuro


Re: Packages to release in 3.0.0-preview

2019-10-25 Thread Takeshi Yamamuro
Thanks for that work!

> I don't think JDK 11 is a separate release (by design). We build
> everything targeting JDK 8 and it should work on JDK 11 too.
+1. a single package working on both jvms looks nice.


On Sat, Oct 26, 2019 at 4:18 AM Sean Owen  wrote:

> I don't think JDK 11 is a separate release (by design). We build
> everything targeting JDK 8 and it should work on JDK 11 too.
>
> So, just two releases, but, frankly I think we soon need to stop
> multiple releases for multiple Hadoop versions, and stick to Hadoop 3.
> I think it's fine to try to release for Hadoop 2 as the support still
> exists, and because the difference happens to be larger due to the
> different Hive dependency.
>
> On Fri, Oct 25, 2019 at 2:08 PM Xingbo Jiang 
> wrote:
> >
> > Hi all,
> >
> > I would like to bring out a discussion on how many packages shall be
> released in 3.0.0-preview, the ones I can think of now:
> >
> > * scala 2.12 + hadoop 2.7
> > * scala 2.12 + hadoop 3.2
> > * scala 2.12 + hadoop 3.2 + JDK 11
> >
> > Do you have other combinations to add to the above list?
> >
> > Cheers,
> >
> > Xingbo
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [build system] intermittent network issues + potential power shutoff over the weekend

2019-10-25 Thread Takeshi Yamamuro
ok, thanks, Shane!

Bests,
Takeshi

On Sat, Oct 26, 2019 at 7:01 AM Shane Knapp  wrote:

> > 2) the OTHER thing is that PG&E will be potentially cutting power to
> > campus again tomorrow evening and over the weekend.  i am unsure 1) if
> > this is happening, and 2) if our campus colo will go on to generator
> > backup power.
> >
> > if the colo does go on backup power, all of the workers will shut down
> > and power back on automatically when power is restored.
> >
> ok, it looks like the colo will have power until monday morning, and
> it will be shut down from 8am to noon to perform some maintenance.
>
> this means jenkins will be up all weekend, but down monday morning.
>
> shane
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Minimum JDK8 version

2019-10-24 Thread Takeshi Yamamuro
> All versions of JDK8 are not the same naturally. For example, Hadoop
community also have the following document although they are not specifying
the minimum versions.
oh, I didn't know that. Thanks for the info and updating the doc!

Bests,
Takeshi

On Fri, Oct 25, 2019 at 12:26 PM Dongjoon Hyun 
wrote:

> Thank you. I created a PR for that. For now, the minimum requirement is
> 8u92 in that PR.
>
> https://github.com/apache/spark/pull/26249
>
> Bests,
> Dongjoon.
>
>
> On Thu, Oct 24, 2019 at 7:55 PM Sean Owen  wrote:
>
>> I think that's fine, personally. Anyone using JDK 8 should / probably
>> is on a recent release.
>>
>> On Thu, Oct 24, 2019 at 8:56 PM Dongjoon Hyun 
>> wrote:
>> >
>> > Thank you for reply, Sean, Shane, Takeshi.
>> >
>> > The reason is that there is a PR to aim to add
>> `-XX:OnOutOfMemoryError="kill -9 %p"` as a default behavior at 3.0.0.
>> > (Please note that the PR will add it by *default* always. There is no
>> way for user to remove it.)
>> >
>> > - [SPARK-27900][CORE][K8s] Add `spark.driver.killOnOOMError` flag
>> in cluster mode
>> > - https://github.com/apache/spark/pull/26161
>> >
>> > If we can deprecate old JDK8 versions, we are able to use JVM option
>> `ExitOnOutOfMemoryError` instead.
>> > (This is added at JDK 8u92. In my previous email, 8u82 was a typo.)
>> >
>> > -
>> https://www.oracle.com/technetwork/java/javase/8u92-relnotes-2949471.html
>> >
>> > All versions of JDK8 are not the same naturally. For example, Hadoop
>> community also have the following document although they are not specifying
>> the minimum versions.
>> >
>> > -
>> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions
>> >
>> > Bests,
>> > Dongjoon.
>> >
>> >
>> > On Thu, Oct 24, 2019 at 6:05 PM Takeshi Yamamuro 
>> wrote:
>> >>
>> >> Hi, Dongjoon
>> >>
>> >> It might be worth clearly describing which jdk versions we check in
>> the testing infra
>> >> in some documents, e.g.,
>> https://spark.apache.org/docs/latest/#downloading
>> >>
>> >> btw, any other project announcing the minimum support jdk version?
>> >> It seems that hadoop does not.
>> >>
>> >> On Fri, Oct 25, 2019 at 6:51 AM Sean Owen  wrote:
>> >>>
>> >>> Probably, but what is the difference that makes it different to
>> >>> support u81 vs later?
>> >>>
>> >>> On Thu, Oct 24, 2019 at 4:39 PM Dongjoon Hyun <
>> dongjoon.h...@gmail.com> wrote:
>> >>> >
>> >>> > Hi, All.
>> >>> >
>> >>> > Apache Spark 3.x will support both JDK8 and JDK11.
>> >>> >
>> >>> > I'm wondering if we can have a minimum JDK8 version in Apache Spark
>> 3.0.
>> >>> >
>> >>> > Specifically, can we start to deprecate JDK8u81 and older at 3.0.
>> >>> >
>> >>> > Currently, Apache Spark testing infra are testing only with
>> jdk1.8.0_191 and above.
>> >>> >
>> >>> > Bests,
>> >>> > Dongjoon.
>> >>>
>> >>> -
>> >>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>>
>> >>
>> >>
>> >> --
>> >> ---
>> >> Takeshi Yamamuro
>>
>

-- 
---
Takeshi Yamamuro


Re: Minimum JDK8 version

2019-10-24 Thread Takeshi Yamamuro
Hi, Dongjoon

It might be worth clearly describing which jdk versions we check in the
testing infra
in some documents, e.g., https://spark.apache.org/docs/latest/#downloading

btw, any other project announcing the minimum support jdk version?
It seems that hadoop does not.

On Fri, Oct 25, 2019 at 6:51 AM Sean Owen  wrote:

> Probably, but what is the difference that makes it different to
> support u81 vs later?
>
> On Thu, Oct 24, 2019 at 4:39 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Apache Spark 3.x will support both JDK8 and JDK11.
> >
> > I'm wondering if we can have a minimum JDK8 version in Apache Spark 3.0.
> >
> > Specifically, can we start to deprecate JDK8u81 and older at 3.0.
> >
> > Currently, Apache Spark testing infra are testing only with jdk1.8.0_191
> and above.
> >
> > Bests,
> > Dongjoon.
>
> ---------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE][SPARK-28885] Follow ANSI store assignment rules in table insertion by default

2019-10-10 Thread Takeshi Yamamuro
Thanks for the great work, Gengliang!

+1 for that.
As I said before, the behaviour is pretty common in DBMSs, so the change
helps for DMBS users.

Bests,
Takeshi


On Mon, Oct 7, 2019 at 5:24 PM Gengliang Wang 
wrote:

> Hi everyone,
>
> I'd like to call for a new vote on SPARK-28885
> <https://issues.apache.org/jira/browse/SPARK-28885> "Follow ANSI store
> assignment rules in table insertion by default" after revising the ANSI
> store assignment policy(SPARK-29326
> <https://issues.apache.org/jira/browse/SPARK-29326>).
> When inserting a value into a column with the different data type, Spark
> performs type coercion. Currently, we support 3 policies for the store
> assignment rules: ANSI, legacy and strict, which can be set via the option
> "spark.sql.storeAssignmentPolicy":
> 1. ANSI: Spark performs the store assignment as per ANSI SQL. In practice,
> the behavior is mostly the same as PostgreSQL. It disallows certain
> unreasonable type conversions such as converting `string` to `int` and
> `double` to `boolean`. It will throw a runtime exception if the value is
> out-of-range(overflow).
> 2. Legacy: Spark allows the store assignment as long as it is a valid
> `Cast`, which is very loose. E.g., converting either `string` to `int` or
> `double` to `boolean` is allowed. It is the current behavior in Spark 2.x
> for compatibility with Hive. When inserting an out-of-range value to an
> integral field, the low-order bits of the value is inserted(the same as
> Java/Scala numeric type casting). For example, if 257 is inserted into a
> field of Byte type, the result is 1.
> 3. Strict: Spark doesn't allow any possible precision loss or data
> truncation in store assignment, e.g., converting either `double` to `int`
> or `decimal` to `double` is allowed. The rules are originally for Dataset
> encoder. As far as I know, no mainstream DBMS is using this policy by
> default.
>
> Currently, the V1 data source uses "Legacy" policy by default, while V2
> uses "Strict". This proposal is to use "ANSI" policy by default for both V1
> and V2 in Spark 3.0.
>
> This vote is open until Friday (Oct. 11).
>
> [ ] +1: Accept the proposal
> [ ] +0
> [ ] -1: I don't think this is a good idea because ...
>
> Thank you!
>
> Gengliang
>


-- 
---
Takeshi Yamamuro


Re: [build system] our colo is having power issues again. there will be a few 'events' this week

2019-09-24 Thread Takeshi Yamamuro
Shane, thanks for the hard work!

Bests,
Takeshi

On Wed, Sep 25, 2019 at 6:07 AM Jungtaek Lim  wrote:

> Awesome, thanks for the quick update!
>
> On Wed, Sep 25, 2019 at 6:04 AM Shane Knapp  wrote:
>
>> no worries.  since we deprecated the packaging builds i put that
>> worker back in to the rotation...  there was a slight env issue but
>> that's fixed and we should be g2g w/python.
>>
>> On Tue, Sep 24, 2019 at 1:50 PM Jungtaek Lim  wrote:
>> >
>> > Hi Shane,
>> >
>> > Thanks for the update, and take care of build system!
>> >
>> > Looks like some of build just got failed without test failures, looks
>> like env. issue.
>> >
>> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/111308/
>> >
>> > + ./dev/run-tests-jenkins
>> > Python versions prior to 2.7 are not supported.
>> >
>> >
>> > Could you please check on this?
>> >
>> > Thanks,
>> > Jungtaek Lim (HeartSaVioR)
>> >
>> >
>> > On Wed, Sep 25, 2019 at 2:48 AM Shane Knapp 
>> wrote:
>> >>
>> >> quick update from our colo admin:  they are going to keep the colo on
>> >> generator power until monday morning and not switch back and forth.
>> >> this is great as we'll have solid uptime until monday morning when we
>> >> go back to grid power.
>> >>
>> >> if this changes i will be certain to update everyone here.
>> >>
>> >> thanks,
>> >>
>> >> shane
>> >>
>> >> On Tue, Sep 24, 2019 at 9:50 AM Shane Knapp 
>> wrote:
>> >> >
>> >> > aand that was quick!  everything is back up and building
>> >> >
>> >> > On Tue, Sep 24, 2019 at 9:39 AM Shane Knapp 
>> wrote:
>> >> > >
>> >> > > power switchover is happening now.  more updates to come once
>> machines
>> >> > > come back up.
>> >> > >
>> >> > > On Mon, Sep 23, 2019 at 3:16 PM Shane Knapp 
>> wrote:
>> >> > > >
>> >> > > > the main transformer for our colo is experiencing major issues,
>> and
>> >> > > > campus be performing emergency work on it starting tomorrow
>> morning
>> >> > > > (tuesday sept 24, 9am PDT).  it's pretty dire.  :(
>> >> > > >
>> >> > > > there's a lot going on, but please expect some sporadic jenkins
>> >> > > > downtime until monday.  here's the abbreviated list of when we
>> can
>> >> > > > expect things to happen:
>> >> > > >
>> >> > > > * tomorrow @ 9am we are switching to generator power.  there
>> will be
>> >> > > > temporary loss of power to the jenkins workers, and if any don't
>> come
>> >> > > > back after the switch i will need to head down there and manually
>> >> > > > power them back on.
>> >> > > >
>> >> > > > * thursday @ 9am, switch back to campus power.  again, i will
>> deal
>> >> > > > w/any workers that don't come back manually.
>> >> > > >
>> >> > > > * saturday @ 9am, back to generator power.  same deal as above.
>> >> > > >
>> >> > > > * saturday or sunday the repairs will be completed.  i will
>> monitor
>> >> > > > the situation, and deal w/any problem servers on my way in to
>> work on
>> >> > > > monday.
>> >> > > >
>> >> > > > i apologize for any inconvenience that this will cause...  and i
>> hope
>> >> > > > that this is the final piece of the power puzzle that our colo
>> team
>> >> > > > has to deal with.
>> >> > > >
>> >> > > > thanks in advance,
>> >> > > >
>> >> > > > shane
>> >> > > > --
>> >> > > > Shane Knapp
>> >> > > > UC Berkeley EECS Research / RISELab Staff Technical Lead
>> >> > > > https://rise.cs.berkeley.edu
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Shane Knapp
>> >> > > UC Berkeley EECS Research / RISELab Staff Technical Lead
>> >> > > https://rise.cs.berkeley.edu
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Shane Knapp
>> >> > UC Berkeley EECS Research / RISELab Staff Technical Lead
>> >> > https://rise.cs.berkeley.edu
>> >>
>> >>
>> >>
>> >> --
>> >> Shane Knapp
>> >> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> >> https://rise.cs.berkeley.edu
>> >>
>> >> -
>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>> >>
>> >
>> >
>> > --
>> > Name : Jungtaek Lim
>> > Blog : http://medium.com/@heartsavior
>> > Twitter : http://twitter.com/heartsavior
>> > LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>


-- 
---
Takeshi Yamamuro


Re: Request for contributor permissions

2019-09-10 Thread Takeshi Yamamuro
Hi, Alaa

Thanks for you contact!
You can file a jira without any permission.

btw, have you checked the contribution guide?
https://spark.apache.org/contributing.html
You'd be better to check that before contributions.

Bests,
Takeshi

On Wed, Sep 11, 2019 at 4:37 AM Alaa Zbair  wrote:

> Hello dev,
>
> I am interested in contributing in the Spark project, please add me to the
> contributors list. My Jira username is: Chilio
>
> Thanks.
>
> Alaa Zbair.
>
>

-- 
---
Takeshi Yamamuro


Re: Welcoming some new committers and PMC members

2019-09-09 Thread Takeshi Yamamuro
Congrats, all!

On Tue, Sep 10, 2019 at 9:45 AM Shane Knapp  wrote:

> congrats everyone!  :)
>
> On Mon, Sep 9, 2019 at 5:32 PM Matei Zaharia 
> wrote:
> >
> > Hi all,
> >
> > The Spark PMC recently voted to add several new committers and one PMC
> member. Join me in welcoming them to their new roles!
> >
> > New PMC member: Dongjoon Hyun
> >
> > New committers: Ryan Blue, Liang-Chi Hsieh, Gengliang Wang, Yuming Wang,
> Weichen Xu, Ruifeng Zheng
> >
> > The new committers cover lots of important areas including ML, SQL, and
> data sources, so it’s great to have them here. All the best,
> >
> > Matei and the Spark PMC
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: Resolving all JIRAs affecting EOL releases

2019-09-07 Thread Takeshi Yamamuro
Thanks for the kind explanation.
I see and it looks ok to me.

On Sun, Sep 8, 2019 at 1:54 PM Hyukjin Kwon  wrote:

> Thanks for checking it.
>
> I think it's fine by two reasons below:
>
> 1. It has another condition for such cases - one year time range.
>   Basically, such PRs have not been merged for one year, which I believe
> are not likely merged soon.
>   The JIRA status will be updated when such PRs are merged anyway.
>
> 2. The JIRAs and PRs ideally should be updated. If the PR authors forgot
> to update affected versions,
> it could be a good ping to update the affected versions in its JIRA,
> which is a good practice I believe.
>
> FWIW, currently setting 'in-Progress' doesn't properly work. It has been
> few months.
> I raised this issue several times at
> http://apache-spark-developers-list.1001551.n3.nabble.com/In-Apache-Spark-JIRA-spark-dev-github-jira-sync-py-not-running-properly-td27077.html
>  because
> it blocked me to search JIRAs. I had to change my JQL to check JIRAs. It's
> still not being fixed. I don't know who to ask about this.
>
> If this is not being fixed, we might not have to care about 'In-Progress'
> anymore.
>
>
> 2019년 9월 8일 (일) 오후 1:31, Takeshi Yamamuro 님이 작성:
>
>> Hi, Hyukjin,
>>
>> I checked entries in the list and I found that some entries have
>> 'In-Progress' in their status and have oepn prs (e.g., SPARK-25211
>> <https://issues.apache.org/jira/browse/SPARK-25211>).
>> We can also close these PRs according to the bulk close?
>> (But, we might need to check the corresponding PRs manually?)
>>
>> Bests,
>> Takeshi
>>
>>
>> On Sun, Sep 8, 2019 at 12:15 PM Hyukjin Kwon  wrote:
>>
>>> HI all,
>>>
>>> We have resolved JIRAs that targets EOL releases (up to Spark 2.2.x) in
>>> order to make it
>>> the manageable size before.
>>> Since Spark 2.3.4 will be EOL release, I plan to do this again roughly
>>> in a week.
>>>
>>> The JIRAs that has not been updated for the last year, and having affect
>>> version of EOL releases will be:
>>>   - Resolved as 'Incomplete' status
>>>   - Has a 'bulk-closed' label.
>>>
>>> I plan to use this JQL
>>>
>>> project = SPARK
>>>   AND status in (Open, "In Progress", Reopened)
>>>   AND (
>>> affectedVersion = EMPTY OR
>>> NOT (affectedVersion in versionMatch("^3.*")
>>>   OR affectedVersion in versionMatch("^2.4.*")
>>> )
>>>   )
>>>   AND updated <= -52w
>>>
>>>
>>> You could click this link and check.
>>>
>>>
>>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20(affectedVersion%20%3D%20EMPTY%20OR%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)))%20AND%20updated%20%3C%3D%20-52w
>>>
>>> Please let me know if you guys have any concern or opinion on this.
>>>
>>> Thanks.
>>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

-- 
---
Takeshi Yamamuro


Re: Resolving all JIRAs affecting EOL releases

2019-09-07 Thread Takeshi Yamamuro
Hi, Hyukjin,

I checked entries in the list and I found that some entries have
'In-Progress' in their status and have oepn prs (e.g., SPARK-25211
<https://issues.apache.org/jira/browse/SPARK-25211>).
We can also close these PRs according to the bulk close?
(But, we might need to check the corresponding PRs manually?)

Bests,
Takeshi


On Sun, Sep 8, 2019 at 12:15 PM Hyukjin Kwon  wrote:

> HI all,
>
> We have resolved JIRAs that targets EOL releases (up to Spark 2.2.x) in
> order to make it
> the manageable size before.
> Since Spark 2.3.4 will be EOL release, I plan to do this again roughly in
> a week.
>
> The JIRAs that has not been updated for the last year, and having affect
> version of EOL releases will be:
>   - Resolved as 'Incomplete' status
>   - Has a 'bulk-closed' label.
>
> I plan to use this JQL
>
> project = SPARK
>   AND status in (Open, "In Progress", Reopened)
>   AND (
> affectedVersion = EMPTY OR
> NOT (affectedVersion in versionMatch("^3.*")
>   OR affectedVersion in versionMatch("^2.4.*")
> )
>   )
>   AND updated <= -52w
>
>
> You could click this link and check.
>
>
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20(affectedVersion%20%3D%20EMPTY%20OR%20NOT%20(affectedVersion%20in%20versionMatch(%22%5E3.*%22)%20OR%20affectedVersion%20in%20versionMatch(%22%5E2.4.*%22)))%20AND%20updated%20%3C%3D%20-52w
>
> Please let me know if you guys have any concern or opinion on this.
>
> Thanks.
>


-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Apache Spark 2.4.4 (RC3)

2019-08-28 Thread Takeshi Yamamuro
I checked the tests passed again on the same env.
It looks ok.


On Thu, Aug 29, 2019 at 6:15 AM Marcelo Vanzin 
wrote:

> +1
>
> On Tue, Aug 27, 2019 at 4:06 PM Dongjoon Hyun 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.4.
> >
> > The vote is open until August 30th 5PM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.4-rc3 (commit
> 7955b3962ac46b89564e0613db7bea98a1478bf2):
> > https://github.com/apache/spark/tree/v2.4.4-rc3
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1332/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc3-docs/
> >
> > The list of bug fixes going into 2.4.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
> >
> > This release is using the release script of the tag v2.4.4-rc3.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.4?
> > ===
> >
> > The current list of open tickets targeted at 2.4.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
>
>
> --
> Marcelo
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Apache Spark 2.4.4 (RC2)

2019-08-26 Thread Takeshi Yamamuro
Hi, Dongjoon

I checked that all the test passed on my Mac/x86_64 env with:
-Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pkubernetes
-Pkubernetes-integration-tests -Psparkr

maropu@~/spark-2.4.4-rc2:$java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

Bests,
Takeshi


On Tue, Aug 27, 2019 at 11:06 AM Sean Owen  wrote:

> +1 as per response to RC1. The existing issues identified there seem
> to have been fixed.
>
>
> On Mon, Aug 26, 2019 at 2:45 AM Dongjoon Hyun 
> wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 2.4.4.
> >
> > The vote is open until August 29th 1AM PST and passes if a majority +1
> PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 2.4.4
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see http://spark.apache.org/
> >
> > The tag to be voted on is v2.4.4-rc2 (commit
> b7a15b69aca8a2fc3f308105e5978a69dff0f4fb):
> > https://github.com/apache/spark/tree/v2.4.4-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc2-bin/
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1327/
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v2.4.4-rc2-docs/
> >
> > The list of bug fixes going into 2.4.4 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12345466
> >
> > This release is using the release script of the tag v2.4.4-rc2.
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> >
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 2.4.4?
> > ===
> >
> > The current list of open tickets targeted at 2.4.4 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 2.4.4
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> >
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [VOTE] Release Apache Spark 2.3.4 (RC1)

2019-08-26 Thread Takeshi Yamamuro
Hi,

Thanks for the release manage!
It seems the staging repository has not been exposed yet?
https://repository.apache.org/content/repositories/orgapachespark-1328/

On Tue, Aug 27, 2019 at 5:28 AM Kazuaki Ishizaki 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 2.3.4.
>
> The vote is open until August 29th 2PM PST and passes if a majority +1 PMC
> votes are cast, with
> a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 2.3.4
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org/
>
> The tag to be voted on is v2.3.4-rc1 (commit
> 8c6f8150f3c6298ff4e1c7e06028f12d7eaf0210):
> https://github.com/apache/spark/tree/v2.3.4-rc1
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-bin/
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1331/
> <https://repository.apache.org/content/repositories/orgapachespark-1328/>
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v2.3.4-rc1-docs/
>
> The list of bug fixes going into 2.3.4 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12344844
>
> FAQ
>
> =
> How can I help test this release?
> =
>
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 2.3.4?
> ===
>
> The current list of open tickets targeted at 2.3.4 can be found at:
> https://issues.apache.org/jira/projects/SPARKand search for "Target
> Version/s" = 2.3.4
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
>
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>

-- 
---
Takeshi Yamamuro


Re: JDK11 Support in Apache Spark

2019-08-25 Thread Takeshi Yamamuro
Cool, congrats!

Bests,
Takeshi

On Mon, Aug 26, 2019 at 1:01 PM Hichame El Khalfi 
wrote:

> That's Awesome !!!
>
> Thanks to everyone that made this possible :cheers:
>
> Hichame
>
> *From:* cloud0...@gmail.com
> *Sent:* August 25, 2019 10:43 PM
> *To:* lix...@databricks.com
> *Cc:* felixcheun...@hotmail.com; ravishankar.n...@gmail.com;
> dongjoon.h...@gmail.com; dev@spark.apache.org; u...@spark.apache.org
> *Subject:* Re: JDK11 Support in Apache Spark
>
> Great work!
>
> On Sun, Aug 25, 2019 at 6:03 AM Xiao Li  wrote:
>
>> Thank you for your contributions! This is a great feature for Spark
>> 3.0! We finally achieve it!
>>
>> Xiao
>>
>> On Sat, Aug 24, 2019 at 12:18 PM Felix Cheung 
>> wrote:
>>
>>> That’s great!
>>>
>>> --
>>> *From:* ☼ R Nair 
>>> *Sent:* Saturday, August 24, 2019 10:57:31 AM
>>> *To:* Dongjoon Hyun 
>>> *Cc:* dev@spark.apache.org ; user @spark/'user
>>> @spark'/spark users/user@spark 
>>> *Subject:* Re: JDK11 Support in Apache Spark
>>>
>>> Finally!!! Congrats
>>>
>>> On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
>>> wrote:
>>>
>>>> Hi, All.
>>>>
>>>> Thanks to your many many contributions,
>>>> Apache Spark master branch starts to pass on JDK11 as of today.
>>>> (with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)
>>>>
>>>>
>>>> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
>>>> (JDK11 is used for building and testing.)
>>>>
>>>> We already verified all UTs (including PySpark/SparkR) before.
>>>>
>>>> Please feel free to use JDK11 in order to build/test/run `master`
>>>> branch and
>>>> share your experience including any issues. It will help Apache Spark
>>>> 3.0.0 release.
>>>>
>>>> For the follow-ups, please follow
>>>> https://issues.apache.org/jira/browse/SPARK-24417 .
>>>> The next step is `how to support JDK8/JDK11 together in a single
>>>> artifact`.
>>>>
>>>> Bests,
>>>> Dongjoon.
>>>>
>>>
>>
>> --
>> [image: Databricks Summit - Watch the talks]
>> <https://databricks.com/sparkaisummit/north-america>
>>
>

-- 
---
Takeshi Yamamuro


Re: Release Spark 2.3.4

2019-08-16 Thread Takeshi Yamamuro
+1, too

Bests,
Takeshi

On Sat, Aug 17, 2019 at 7:25 AM Dongjoon Hyun 
wrote:

> +1 for 2.3.4 release as the last release for `branch-2.3` EOL.
>
> Also, +1 for next week release.
>
> Bests,
> Dongjoon.
>
>
> On Fri, Aug 16, 2019 at 8:19 AM Sean Owen  wrote:
>
>> I think it's fine to do these in parallel, yes. Go ahead if you are
>> willing.
>>
>> On Fri, Aug 16, 2019 at 9:48 AM Kazuaki Ishizaki 
>> wrote:
>> >
>> > Hi, All.
>> >
>> > Spark 2.3.3 was released six months ago (15th February, 2019) at
>> http://spark.apache.org/news/spark-2-3-3-released.html. And, about 18
>> months have been passed after Spark 2.3.0 has been released (28th February,
>> 2018).
>> > As of today (16th August), there are 103 commits (69 JIRAs) in
>> `branch-23` since 2.3.3.
>> >
>> > It would be great if we can have Spark 2.3.4.
>> > If it is ok, shall we start `2.3.4 RC1` concurrent with 2.4.4 or after
>> 2.4.4 will be released?
>> >
>> > A issue list in jira:
>> https://issues.apache.org/jira/projects/SPARK/versions/12344844
>> > A commit list in github from the last release:
>> https://github.com/apache/spark/compare/66fd9c34bf406a4b5f86605d06c9607752bd637a...branch-2.3
>> > The 8 correctness issues resolved in branch-2.3:
>> >
>> https://issues.apache.org/jira/browse/SPARK-26873?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012344844%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC
>> >
>> > Best Regards,
>> > Kazuaki Ishizaki
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
---
Takeshi Yamamuro


Re: Release Apache Spark 2.4.4

2019-08-13 Thread Takeshi Yamamuro
Hi,

Thanks for your notification, Dongjoon!
I put some links for the other committers/PMCs to access the info easily:

A commit list in github from the last release:
https://github.com/apache/spark/compare/5ac2014e6c118fbeb1fe8e5c8064c4a8ee9d182a...branch-2.4
A issue list in jira:
https://issues.apache.org/jira/projects/SPARK/versions/12345466#release-report-tab-body
The 5 correctness issues resolved in branch-2.4:
https://issues.apache.org/jira/browse/SPARK-27798?jql=project%20%3D%2012315420%20AND%20fixVersion%20%3D%2012345466%20AND%20labels%20in%20(%27correctness%27)%20ORDER%20BY%20priority%20DESC%2C%20key%20ASC

Anyway, +1

Best,
Takeshi

On Wed, Aug 14, 2019 at 8:25 AM DB Tsai  wrote:

> +1
>
> On Tue, Aug 13, 2019 at 4:16 PM Dongjoon Hyun 
> wrote:
> >
> > Hi, All.
> >
> > Spark 2.4.3 was released three months ago (8th May).
> > As of today (13th August), there are 112 commits (75 JIRAs) in
> `branch-24` since 2.4.3.
> >
> > It would be great if we can have Spark 2.4.4.
> > Shall we start `2.4.4 RC1` next Monday (19th August)?
> >
> > Last time, there was a request for K8s issue and now I'm waiting for
> SPARK-27900.
> > Please let me know if there is another issue.
> >
> > Thanks,
> > Dongjoon.
>
> -----
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
---
Takeshi Yamamuro


Re: [build system] upcoming jenkins downtime: august 3rd 2019

2019-08-01 Thread Takeshi Yamamuro
hi, shane,
Thanks for your hard work!!

Bests,
Takeshi

On Fri, Aug 2, 2019 at 5:27 AM shane knapp  wrote:

> here's the latest timetable:
>
> * all machines powered off some time tomorrow (friday) night ~9pm
> * sunday morning, all machines will be powered back up
> * if any stragglers fail to come back, we will investigate monday morning
>
> On Tue, Jul 30, 2019 at 11:30 AM shane knapp  wrote:
>
>> On Fri, Jun 14, 2019 at 9:13 AM shane knapp  wrote:
>>
>>> the campus colo will be performing some electrical maintenance, which
>>> means that they'll be powering off the entire building.
>>>
>>> since the jenkins cluster is located in that colo, we are most
>>> definitely affected.  :)
>>>
>>> i'll be out of town that weekend, but will have one of my sysadmins
>>> bring everything back up on sunday, august 4th.  if they run in to issues,
>>> i will jump in first thing monday, august 5th.
>>>
>>> as the time approaches, i will send reminders and updates.
>>>
>>> hey everyone, just wanted to post a reminder about the upcoming jenkins
>> outage this weekend.
>>
>> machines will be powered off friday night, and hopefully everything comes
>> back up on sunday.
>>
>> if we have any problems, i will take care of things monday morning.
>>
>>
>>
>> --
>> Shane Knapp
>> UC Berkeley EECS Research / RISELab Staff Technical Lead
>> https://rise.cs.berkeley.edu
>>
>
>
> --
> Shane Knapp
> UC Berkeley EECS Research / RISELab Staff Technical Lead
> https://rise.cs.berkeley.edu
>


-- 
---
Takeshi Yamamuro


Re: [Discuss] Follow ANSI SQL on table insertion

2019-07-27 Thread Takeshi Yamamuro
ior follow SQL standard.
>>
>> On Sat, Jul 27, 2019 at 1:35 AM Ryan Blue  wrote:
>>
>>> I don’t think this is a good idea. Following the ANSI standard is
>>> usually fine, but here it would *silently corrupt data*.
>>>
>>> From your proposal doc, ANSI allows implicitly casting from long to int
>>> (any numeric type to any other numeric type) and inserts NULL when a value
>>> overflows. That would drop data values and is not safe.
>>>
>>> Fixing the silent corruption by adding a runtime exception is not a good
>>> option, either. That puts off the problem until much of the job has
>>> completed, instead of catching the error at analysis time. It is better to
>>> catch this earlier during analysis than to run most of a job and then fail.
>>>
>>> In addition, part of the justification for using the ANSI standard is to
>>> avoid breaking existing jobs. But the new behavior is only applied in
>>> DataSourceV2, so it won’t affect existing jobs until sources move to v2 and
>>> break other behavior anyway.
>>>
>>> I think that the correct solution is to go with the existing validation
>>> rules that require explicit casts to truncate values.
>>>
>>> That still leaves the use case that motivated this proposal, which is
>>> that floating point literals are parsed as decimals and fail simple insert
>>> statements. We already came up with two alternatives to fix that problem in
>>> the DSv2 sync and I think it is a better idea to go with one of those
>>> instead of “fixing” Spark in a way that will corrupt data or cause runtime
>>> failures.
>>>
>>> On Thu, Jul 25, 2019 at 9:11 AM Wenchen Fan  wrote:
>>>
>>>> I have heard about many complaints about the old table insertion
>>>> behavior. Blindly casting everything will leak the user mistake to a late
>>>> stage of the data pipeline, and make it very hard to debug. When a user
>>>> writes string values to an int column, it's probably a mistake and the
>>>> columns are misordered in the INSERT statement. We should fail the query
>>>> earlier and ask users to fix the mistake.
>>>>
>>>> In the meanwhile, I agree that the new table insertion behavior we
>>>> introduced for Data Source V2 is too strict. It may fail valid queries
>>>> unexpectedly.
>>>>
>>>> In general, I support the direction of following the ANSI SQL standard.
>>>> But I'd like to do it with 2 steps:
>>>> 1. only add cast when the assignment rule is satisfied. This should be
>>>> the default behavior and we should provide a legacy config to restore to
>>>> the old behavior.
>>>> 2. fail the cast operation at runtime if overflow happens. AFAIK Marco
>>>> Gaido is working on it already. This will have a config as well and by
>>>> default we still return null.
>>>>
>>>> After doing this, the default behavior will be slightly different from
>>>> the SQL standard (cast can return null), and users can turn on the ANSI
>>>> mode to fully follow the SQL standard. This is much better than before and
>>>> should prevent a lot of user mistakes. It's also a reasonable choice to me
>>>> to not throw exceptions at runtime by default, as it's usually bad for
>>>> long-running jobs.
>>>>
>>>> Thanks,
>>>> Wenchen
>>>>
>>>> On Thu, Jul 25, 2019 at 11:37 PM Gengliang Wang <
>>>> gengliang.w...@databricks.com> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I would like to discuss the table insertion behavior of Spark. In the
>>>>> current data source V2, only UpCast is allowed for table insertion. I 
>>>>> think
>>>>> following ANSI SQL is a better idea.
>>>>> For more information, please read the Discuss: Follow ANSI SQL on
>>>>> table insertion
>>>>> <https://docs.google.com/document/d/1b9nnWWbKVDRp7lpzhQS1buv1_lDzWIZY2ApFs5rBcGI/edit?usp=sharing>
>>>>> Please let me know if you have any thoughts on this.
>>>>>
>>>>> Regards,
>>>>> Gengliang
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Netflix
>>>
>>

-- 
---
Takeshi Yamamuro


  1   2   >