[VOTE] Release Spark 3.0.2 (RC1)

2021-02-15 Thread Dongjoon Hyun
Please vote on releasing the following candidate as Apache Spark version
3.0.2.

The vote is open until February 19th 9AM (PST) and passes if a majority +1
PMC votes are cast, with a minimum of 3 +1 votes.

[ ] +1 Release this package as Apache Spark 3.0.2
[ ] -1 Do not release this package because ...

To learn more about Apache Spark, please see https://spark.apache.org/

The tag to be voted on is v3.0.2-rc1 (commit
648457905c4ea7d00e3d88048c63f360045f0714):
https://github.com/apache/spark/tree/v3.0.2-rc1

The release files, including signatures, digests, etc. can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-bin/

Signatures used for Spark RCs can be found in this file:
https://dist.apache.org/repos/dist/dev/spark/KEYS

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1366/

The documentation corresponding to this release can be found at:
https://dist.apache.org/repos/dist/dev/spark/v3.0.2-rc1-docs/

The list of bug fixes going into 3.0.2 can be found at the following URL:
https://issues.apache.org/jira/projects/SPARK/versions/12348739

FAQ

=
How can I help test this release?
=

If you are a Spark user, you can help us test this release by taking
an existing Spark workload and running on this release candidate, then
reporting any regressions.

If you're working in PySpark you can set up a virtual env and install
the current RC and see if anything important breaks, in the Java/Scala
you can add the staging repository to your projects resolvers and test
with the RC (make sure to clean up the artifact cache before/after so
you don't end up building with a out of date RC going forward).

===
What should happen to JIRA tickets still targeting 3.0.2?
===

The current list of open tickets targeted at 3.0.2 can be found at:
https://issues.apache.org/jira/projects/SPARK and search for "Target
Version/s" = 3.0.2

Committers should look at those and triage. Extremely important bug
fixes, documentation, and API tweaks that impact compatibility should
be worked on immediately. Everything else please retarget to an
appropriate release.

==
But my bug isn't fixed?
==

In order to make timely releases, we will typically not hold the
release unless the bug in question is a regression from the previous
release. That being said, if there is something which is a regression
that has not been correctly targeted please ping me or a committer to
help target the issue.


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ye Xianjin
Hi,

Thanks for Ryan and Wenchen for leading this.

I’d like to add my two cents here. In production environments, the function 
catalog might be used by multiple systems, such as Spark, Presto and Hive.  Is 
it possible that this function catalog is designed with as an unified function 
catalog in mind, or at least it wouldn’t that difficult to extend this catalog 
as an unified one. 

P.S. We registered a lot of UDFs in hive HMS in our production environment, and 
those UDFs are shared by Spark and Presto. It works well even through with a 
lot of drawbacks.

Sent from my iPhone

> On Feb 16, 2021, at 2:44 AM, Ryan Blue  wrote:
> 
> 
> Thanks for the positive feedback, everyone. It sounds like there is a clear 
> path forward for calling functions. Even without a prototype, the `invoke` 
> plans show that Wenchen's suggested optimization can be done, and 
> incorporating it as an optional extension to this proposal solves many of the 
> unknowns.
> 
> With that area now understood, is there any discussion about other parts of 
> the proposal, besides the function call interface?
> 
>> On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:
>> This is an important feature which can unblock several other projects 
>> including bucket join support for DataSource v2, complete support for 
>> enforcing DataSource v2 distribution requirements on the write path, etc. I 
>> like Ryan's proposals which look simple and elegant, with nice support on 
>> function overloading and variadic arguments. On the other hand, I think 
>> Wenchen made a very good point about performance. Overall, I'm excited to 
>> see active discussions on this topic and believe the community will come to 
>> a proposal with the best of both sides.
>> 
>> Chao
>> 
>>> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>>> +1 for Liang-chi's.
>>> 
>>> Thanks Ryan and Wenchen for leading this.
>>> 
>>> 
>>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
 Basically I think the proposal makes sense to me and I'd like to support 
 the
 SPIP as it looks like we have strong need for the important feature.
 
 Thanks Ryan for working on this and I do also look forward to Wenchen's
 implementation. Thanks for the discussion too.
 
 Actually I think the SupportsInvoke proposed by Ryan looks a good
 alternative to me. Besides Wenchen's alternative implementation, is there a
 chance we also have the SupportsInvoke for comparison?
 
 
 John Zhuge wrote
 > Excited to see our Spark community rallying behind this important 
 > feature!
 > 
 > The proposal lays a solid foundation of minimal feature set with careful
 > considerations for future optimizations and extensions. Can't wait to see
 > it leading to more advanced functionalities like views with shared custom
 > functions, function pushdown, lambda, etc. It has already borne fruit 
 > from
 > the constructive collaborations in this thread. Looking forward to
 > Wenchen's prototype and further discussions including the SupportsInvoke
 > extension proposed by Ryan.
 > 
 > 
 > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley <
 
 > owen.omalley@
 
 > >
 > wrote:
 > 
 >> I think this proposal is a very good thing giving Spark a standard way 
 >> of
 >> getting to and calling UDFs.
 >>
 >> I like having the ScalarFunction as the API to call the UDFs. It is
 >> simple, yet covers all of the polymorphic type cases well. I think it
 >> would
 >> also simplify using the functions in other contexts like pushing down
 >> filters into the ORC & Parquet readers although there are a lot of
 >> details
 >> that would need to be considered there.
 >>
 >> .. Owen
 >>
 >>
 >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen <
 
 > ekrogen@.com
 
 > >
 >> wrote:
 >>
 >>> I agree that there is a strong need for a FunctionCatalog within Spark
 >>> to
 >>> provide support for shareable UDFs, as well as make movement towards
 >>> more
 >>> advanced functionality like views which themselves depend on UDFs, so I
 >>> support this SPIP wholeheartedly.
 >>>
 >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
 >>> and
 >>> extensible. I generally think Wenchen's proposal is easier for a user 
 >>> to
 >>> work with in the common case, but has greater potential for confusing
 >>> and
 >>> hard-to-debug behavior due to use of reflective method signature
 >>> searches.
 >>> The merits on both sides can hopefully be more properly examined with
 >>> code,
 >>> so I look forward to seeing an implementation of Wenchen's ideas to
 >>> provide
 >>> a more concrete comparison. I am optimistic that we will not let the
 >>> debate
 >>> over this point unreasonably stall the SPIP from making progress.
 >>>
 >>> Thank

Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-15 Thread Jungtaek Lim
Thanks for the input, Hyukjin!

I have been keeping my own policy among all discussions I have raised - I
would provide the hypothetical example closer to the actual one and avoid
pointing out directly. The main purpose of the discussion is to ensure our
policy / consensus makes sense, no more. I can provide a more detailed
explanation if someone feels the explanation wasn't sufficient to
understand.

Probably this discussion could play as a "reminder" to every committers if
similar discussion was raised before and it succeeded to build consensus.
If there's some point we don't build consensus yet, it'd be a good time to
discuss further. I don't know what exactly was the discussion and the
result so what is new here, but I guess this might be a duplicated one as
you say similar issue.



On Tue, Feb 16, 2021 at 11:09 AM Hyukjin Kwon  wrote:

> I remember I raised a similar issue a long time ago in the dev mailing
> list. I agree that setting no assignee makes sense in most of the cases,
> and also think we share similar thoughts about the assignee on
> umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
> etc.
> It makes me think that the actual issue by setting an assignee happens
> rarely, and it is an issue to several specific cases that would need a look
> case-by-case.
> Were there specific cases that made you concerned?
>
>
> 2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이
> 작성:
>
>> Hi devs,
>>
>> I'd like to raise a discussion and hear voices on the "assignee" practice
>> on committers which may lead issues on preemption.
>>
>> I feel this is the one of major unfairnesses between contributors and
>> committers if used improperly, especially when someone assigns themselves
>> with multiple JIRA issues.
>>
>> Let's say there're features A and B, which may take a month for each (or
>> require design doc) - both are individual major features, not subtasks or
>> some sort of "follow-up".
>>
>> Technically, committers can file two JIRA issues and assign both of
>> issues, "without actually doing no progress", and implicitly ensure no one
>> works on these issues for a couple of months. Even just a plan on backlog
>> can prevent others from taking up.
>>
>> I don't think this is fair with contributors, because contributors don't
>> tend to file an JIRA issue unless they made a lot of progress. (I'd like to
>> remind you, competition from contributor's position is quite tense and
>> stressful.) Say they already spent a month working on it and testing it in
>> production. They feel ready and visit JIRA, and realize the JIRA issue was
>> made and assigned to someone, while there's no progress on the JIRA issue.
>> No idea how much progress "someone" makes. They "might" ask about the
>> progress, but nothing will change if "someone" simply says "I'm still
>> working on this" (with even 1% of progress). Isn't this actually against
>> the reason we don't allow setting assignee to contributor?
>>
>> For sure, assigning the issue would make sense if the issue is a subtask
>> or follow-up, or the issue made explicit progress like design doc is being
>> put. In other cases I don't see any reason assigning the issue explicitly.
>> Someone may say to contributors, just leave a comment "I'm working on it",
>> but isn't it also something committers can also do when they are "actually"
>> working?
>>
>> I think committers should have no advantage on the possible competition
>> on contribution, and setting assignee without explicit progress makes me
>> worried.
>> To make it fair, either we should allow contributors to assign them or
>> don't allow committers to assign them unless extreme cases - they can still
>> use the approach contributors do.
>> (Again I'd feel OK to assign if there's a design doc proving that they
>> really spent non-trivial effort already. My point is preempting JIRA issues
>> with only sketched ideas or even just rationalizations.)
>>
>> Would like to hear everyone's voices.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> ps. better yet, probably it's better then to restrict something
>> explicitly if we sincerely respect the underlying culture on the statement
>> "In case several people contributed, prefer to assign to the more ‘junior’,
>> non-committer contributor".
>>
>>
>>


Re: [DISCUSS] assignee practice on committers+ (possible issue on preemption)

2021-02-15 Thread Hyukjin Kwon
I remember I raised a similar issue a long time ago in the dev mailing
list. I agree that setting no assignee makes sense in most of the cases,
and also think we share similar thoughts about the assignee on
umbrella JIRAs, followup tasks, the case when it's clear with a design doc,
etc.
It makes me think that the actual issue by setting an assignee happens
rarely, and it is an issue to several specific cases that would need a look
case-by-case.
Were there specific cases that made you concerned?


2021년 2월 15일 (월) 오전 8:58, Jungtaek Lim 님이 작성:

> Hi devs,
>
> I'd like to raise a discussion and hear voices on the "assignee" practice
> on committers which may lead issues on preemption.
>
> I feel this is the one of major unfairnesses between contributors and
> committers if used improperly, especially when someone assigns themselves
> with multiple JIRA issues.
>
> Let's say there're features A and B, which may take a month for each (or
> require design doc) - both are individual major features, not subtasks or
> some sort of "follow-up".
>
> Technically, committers can file two JIRA issues and assign both of
> issues, "without actually doing no progress", and implicitly ensure no one
> works on these issues for a couple of months. Even just a plan on backlog
> can prevent others from taking up.
>
> I don't think this is fair with contributors, because contributors don't
> tend to file an JIRA issue unless they made a lot of progress. (I'd like to
> remind you, competition from contributor's position is quite tense and
> stressful.) Say they already spent a month working on it and testing it in
> production. They feel ready and visit JIRA, and realize the JIRA issue was
> made and assigned to someone, while there's no progress on the JIRA issue.
> No idea how much progress "someone" makes. They "might" ask about the
> progress, but nothing will change if "someone" simply says "I'm still
> working on this" (with even 1% of progress). Isn't this actually against
> the reason we don't allow setting assignee to contributor?
>
> For sure, assigning the issue would make sense if the issue is a subtask
> or follow-up, or the issue made explicit progress like design doc is being
> put. In other cases I don't see any reason assigning the issue explicitly.
> Someone may say to contributors, just leave a comment "I'm working on it",
> but isn't it also something committers can also do when they are "actually"
> working?
>
> I think committers should have no advantage on the possible competition on
> contribution, and setting assignee without explicit progress makes me
> worried.
> To make it fair, either we should allow contributors to assign them or
> don't allow committers to assign them unless extreme cases - they can still
> use the approach contributors do.
> (Again I'd feel OK to assign if there's a design doc proving that they
> really spent non-trivial effort already. My point is preempting JIRA issues
> with only sketched ideas or even just rationalizations.)
>
> Would like to hear everyone's voices.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps. better yet, probably it's better then to restrict something explicitly
> if we sincerely respect the underlying culture on the statement "In case
> several people contributed, prefer to assign to the more ‘junior’,
> non-committer contributor".
>
>
>


Re: [DISCUSS] SPIP: FunctionCatalog

2021-02-15 Thread Ryan Blue
Thanks for the positive feedback, everyone. It sounds like there is a clear
path forward for calling functions. Even without a prototype, the `invoke`
plans show that Wenchen's suggested optimization can be done, and
incorporating it as an optional extension to this proposal solves many of
the unknowns.

With that area now understood, is there any discussion about other parts of
the proposal, besides the function call interface?

On Fri, Feb 12, 2021 at 10:40 PM Chao Sun  wrote:

> This is an important feature which can unblock several other projects
> including bucket join support for DataSource v2, complete support for
> enforcing DataSource v2 distribution requirements on the write path, etc. I
> like Ryan's proposals which look simple and elegant, with nice support on
> function overloading and variadic arguments. On the other hand, I think
> Wenchen made a very good point about performance. Overall, I'm excited to
> see active discussions on this topic and believe the community will come to
> a proposal with the best of both sides.
>
> Chao
>
> On Fri, Feb 12, 2021 at 7:58 PM Hyukjin Kwon  wrote:
>
>> +1 for Liang-chi's.
>>
>> Thanks Ryan and Wenchen for leading this.
>>
>>
>> 2021년 2월 13일 (토) 오후 12:18, Liang-Chi Hsieh 님이 작성:
>>
>>> Basically I think the proposal makes sense to me and I'd like to support
>>> the
>>> SPIP as it looks like we have strong need for the important feature.
>>>
>>> Thanks Ryan for working on this and I do also look forward to Wenchen's
>>> implementation. Thanks for the discussion too.
>>>
>>> Actually I think the SupportsInvoke proposed by Ryan looks a good
>>> alternative to me. Besides Wenchen's alternative implementation, is
>>> there a
>>> chance we also have the SupportsInvoke for comparison?
>>>
>>>
>>> John Zhuge wrote
>>> > Excited to see our Spark community rallying behind this important
>>> feature!
>>> >
>>> > The proposal lays a solid foundation of minimal feature set with
>>> careful
>>> > considerations for future optimizations and extensions. Can't wait to
>>> see
>>> > it leading to more advanced functionalities like views with shared
>>> custom
>>> > functions, function pushdown, lambda, etc. It has already borne fruit
>>> from
>>> > the constructive collaborations in this thread. Looking forward to
>>> > Wenchen's prototype and further discussions including the
>>> SupportsInvoke
>>> > extension proposed by Ryan.
>>> >
>>> >
>>> > On Fri, Feb 12, 2021 at 4:35 PM Owen O'Malley <
>>>
>>> > owen.omalley@
>>>
>>> > >
>>> > wrote:
>>> >
>>> >> I think this proposal is a very good thing giving Spark a standard
>>> way of
>>> >> getting to and calling UDFs.
>>> >>
>>> >> I like having the ScalarFunction as the API to call the UDFs. It is
>>> >> simple, yet covers all of the polymorphic type cases well. I think it
>>> >> would
>>> >> also simplify using the functions in other contexts like pushing down
>>> >> filters into the ORC & Parquet readers although there are a lot of
>>> >> details
>>> >> that would need to be considered there.
>>> >>
>>> >> .. Owen
>>> >>
>>> >>
>>> >> On Fri, Feb 12, 2021 at 11:07 PM Erik Krogen <
>>>
>>> > ekrogen@.com
>>>
>>> > >
>>> >> wrote:
>>> >>
>>> >>> I agree that there is a strong need for a FunctionCatalog within
>>> Spark
>>> >>> to
>>> >>> provide support for shareable UDFs, as well as make movement towards
>>> >>> more
>>> >>> advanced functionality like views which themselves depend on UDFs,
>>> so I
>>> >>> support this SPIP wholeheartedly.
>>> >>>
>>> >>> I find both of the proposed UDF APIs to be sufficiently user-friendly
>>> >>> and
>>> >>> extensible. I generally think Wenchen's proposal is easier for a
>>> user to
>>> >>> work with in the common case, but has greater potential for confusing
>>> >>> and
>>> >>> hard-to-debug behavior due to use of reflective method signature
>>> >>> searches.
>>> >>> The merits on both sides can hopefully be more properly examined with
>>> >>> code,
>>> >>> so I look forward to seeing an implementation of Wenchen's ideas to
>>> >>> provide
>>> >>> a more concrete comparison. I am optimistic that we will not let the
>>> >>> debate
>>> >>> over this point unreasonably stall the SPIP from making progress.
>>> >>>
>>> >>> Thank you to both Wenchen and Ryan for your detailed consideration
>>> and
>>> >>> evaluation of these ideas!
>>> >>> --
>>> >>> *From:* Dongjoon Hyun <
>>>
>>> > dongjoon.hyun@
>>>
>>> > >
>>> >>> *Sent:* Wednesday, February 10, 2021 6:06 PM
>>> >>> *To:* Ryan Blue <
>>>
>>> > blue@
>>>
>>> > >
>>> >>> *Cc:* Holden Karau <
>>>
>>> > holden@
>>>
>>> > >; Hyukjin Kwon <
>>> >>>
>>>
>>> > gurwls223@
>>>
>>> >>; Spark Dev List <
>>>
>>> > dev@.apache
>>>
>>> > >; Wenchen Fan
>>> >>> <
>>>
>>> > cloud0fan@
>>>
>>> > >
>>> >>> *Subject:* Re: [DISCUSS] SPIP: FunctionCatalog
>>> >>>
>>> >>> BTW, I forgot to add my opinion explicitly in this thread because I
>>> was
>>> >>> on the PR before this thread.
>>> >>>
>>> >>> 1. The `FunctionCatalog API` PR

Re: Is there any inplict RDD cache operation for query optimizations?

2021-02-15 Thread attilapiros
hi,

There is good reason why the decision about caching is left for the user.
Spark does not know about the future of the DataFrames and RDDs.

Think about how your program is running (you are still running program), so
there is an exact point where the execution is and when Spark reaches an
action it evaluates the Spark job but it does not know about the future
jobs. A cached data would be only useful for that future job which will
reuses it.

On the other hand this information is available for the user as he writes
all the jobs.

Attila



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org