Re: Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-03 Thread Yang,Jie(INF)
Hi, Dongjoon

Our company(Baidu) is still using the combination of Spark 3.3 + Hadoop 2.7.4 
in the production environment. Hadoop 2.7.4 is an internally maintained version 
compiled by Java 8. Although we are using Hadoop 2, I still support this 
proposal because it is positive and exciting.

Regards,
YangJie

发件人: Dongjoon Hyun 
日期: 2022年10月4日 星期二 11:16
收件人: dev 
主题: Dropping Apache Spark Hadoop2 Binary Distribution?

Hi, All.

I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
is still used by someone in the community or not. If it's not used or not 
useful,
we may remove it from Apache Spark 3.4.0 release.


https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz

Here is the background of this question.
Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
Spark community has been building and releasing with Java 8 only.
I believe that the user applications also use Java8+ in these days.
Recently, I received the following message from the Hadoop PMC.

  > "if you really want to claim hadoop 2.x compatibility, then you have to
  > be building against java 7". Otherwise a lot of people with hadoop 2.x
  > clusters won't be able to run your code. If your projects are java8+
  > only, then they are implicitly hadoop 3.1+, no matter what you use
  > in your build. Hence: no need for branch-2 branches except
  > to complicate your build/test/release processes [1]

If Hadoop2 binary distribution is no longer used as of today,
or incomplete somewhere due to Java 8 building, the following three
existing alternative Hadoop 3 binary distributions could be
the better official solution for old Hadoop 2 clusters.

1) Scala 2.12 and without-hadoop distribution
2) Scala 2.12 and Hadoop 3 distribution
3) Scala 2.13 and Hadoop 3 distribution

In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary 
distribution?

Dongjoon

[1] 
https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247


Dropping Apache Spark Hadoop2 Binary Distribution?

2022-10-03 Thread Dongjoon Hyun
Hi, All.

I'm wondering if the following Apache Spark Hadoop2 Binary Distribution
is still used by someone in the community or not. If it's not used or not
useful,
we may remove it from Apache Spark 3.4.0 release.


https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.tgz

Here is the background of this question.
Since Apache Spark 2.2.0 (SPARK-19493, SPARK-19550), the Apache
Spark community has been building and releasing with Java 8 only.
I believe that the user applications also use Java8+ in these days.
Recently, I received the following message from the Hadoop PMC.

  > "if you really want to claim hadoop 2.x compatibility, then you have to
  > be building against java 7". Otherwise a lot of people with hadoop 2.x
  > clusters won't be able to run your code. If your projects are java8+
  > only, then they are implicitly hadoop 3.1+, no matter what you use
  > in your build. Hence: no need for branch-2 branches except
  > to complicate your build/test/release processes [1]

If Hadoop2 binary distribution is no longer used as of today,
or incomplete somewhere due to Java 8 building, the following three
existing alternative Hadoop 3 binary distributions could be
the better official solution for old Hadoop 2 clusters.

1) Scala 2.12 and without-hadoop distribution
2) Scala 2.12 and Hadoop 3 distribution
3) Scala 2.13 and Hadoop 3 distribution

In short, is there anyone who is using Apache Spark 3.3.0 Hadoop2 Binary
distribution?

Dongjoon

[1]
https://issues.apache.org/jira/browse/ORC-1251?focusedCommentId=17608247=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17608247


Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-03 Thread Mridul Muralidharan
+1 from me, with a few comments.

I saw the following failures, are these known issues/flakey tests ?

* PersistenceEngineSuite.ZooKeeperPersistenceEngine
Looks like a port conflict issue from a quick look into logs (conflict with
starting admin port at 8080) - is this expected behavior for the test ?
I worked around it by shutting down the process which was using the port -
though did not investigate deeply.

* org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite was aborted
It is expecting these artifacts in $HOME/.m2/repository

1. tomcat#jasper-compiler;5.5.23!jasper-compiler.jar
2. tomcat#jasper-runtime;5.5.23!jasper-runtime.jar
3. commons-el#commons-el;1.0!commons-el.jar
4. org.apache.hive#hive-exec;2.3.7!hive-exec.jar

I worked around it by adding them locally explicitly - we should probably
add them as test dependency ?
Not sure if this changed in this release though (I had cleaned my local .m2
recently)

Other than this, rest looks good to me.

Regards,
Mridul


On Wed, Sep 28, 2022 at 2:56 PM Sean Owen  wrote:

> +1 from me, same result as last RC.
>
> On Wed, Sep 28, 2022 at 12:21 AM Yuming Wang  wrote:
>
>> Please vote on releasing the following candidate as Apache Spark version 
>> 3.3.1.
>>
>> The vote is open until 11:59pm Pacific time October 3th and passes if a 
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>>
>> [ ] +1 Release this package as Apache Spark 3.3.1
>> [ ] -1 Do not release this package because ...
>>
>> To learn more about Apache Spark, please see https://spark.apache.org
>>
>> The tag to be voted on is v3.3.1-rc2 (commit 
>> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
>> https://github.com/apache/spark/tree/v3.3.1-rc2
>>
>> The release files, including signatures, digests, etc. can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>>
>> Signatures used for Spark RCs can be found in this file:
>> https://dist.apache.org/repos/dist/dev/spark/KEYS
>>
>> The staging repository for this release can be found at:
>> https://repository.apache.org/content/repositories/orgapachespark-1421
>>
>> The documentation corresponding to this release can be found at:
>> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>>
>> The list of bug fixes going into 3.3.1 can be found at the following URL:
>> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>>
>> This release is using the release script of the tag v3.3.1-rc2.
>>
>>
>> FAQ
>>
>> =
>> How can I help test this release?
>> =
>> If you are a Spark user, you can help us test this release by taking
>> an existing Spark workload and running on this release candidate, then
>> reporting any regressions.
>>
>> If you're working in PySpark you can set up a virtual env and install
>> the current RC and see if anything important breaks, in the Java/Scala
>> you can add the staging repository to your projects resolvers and test
>> with the RC (make sure to clean up the artifact cache before/after so
>> you don't end up building with a out of date RC going forward).
>>
>> ===
>> What should happen to JIRA tickets still targeting 3.3.1?
>> ===
>> The current list of open tickets targeted at 3.3.1 can be found at:
>> https://issues.apache.org/jira/projects/SPARK and search for "Target 
>> Version/s" = 3.3.1
>>
>> Committers should look at those and triage. Extremely important bug
>> fixes, documentation, and API tweaks that impact compatibility should
>> be worked on immediately. Everything else please retarget to an
>> appropriate release.
>>
>> ==
>> But my bug isn't fixed?
>> ==
>> In order to make timely releases, we will typically not hold the
>> release unless the bug in question is a regression from the previous
>> release. That being said, if there is something which is a regression
>> that has not been correctly targeted please ping me or a committer to
>> help target the issue.
>>
>>
>>
>>


Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-03 Thread Dongjoon Hyun
Sorry, but -1 due to the undocumented breaking query result change.

Apache Spark 3.2.0, 3.2.1, 3.2.2, 3.3.0 has the following result for
`grouping_id()` and `grouping__id`.

scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS
t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
+++
|count(1)|grouping__id|
+++
|   1|   1|
|   1|   1|
+++

Only Apache Spark 3.3.1 RC has

scala> sql("SELECT count(*), grouping__id from (VALUES (1,1,1),(2,2,2)) AS
t(k1,k2,v) GROUP BY k1 GROUPING SETS (k2) ").show()
+++
|count(1)|grouping__id|
+++
|   1|   2|
|   1|   2|
+++

Although it's a result of a valid fix from SPARK-40218, it would be great
if we can provide a migration guide and legacy configuration (SPARK-40562)
to avoid any production outage during a migration from 3.3.0 to 3.3.1.

Dongjoon.


On Mon, Oct 3, 2022 at 11:11 AM Gengliang Wang  wrote:

> +1. I ran some simple tests and also verified that SPARK-40389 is fixed.
>
> Gengliang
>
> On Mon, Oct 3, 2022 at 8:56 AM Thomas Graves  wrote:
>
>> +1. ran out internal tests and everything looks good.
>>
>> Tom Graves
>>
>> On Wed, Sep 28, 2022 at 12:20 AM Yuming Wang  wrote:
>> >
>> > Please vote on releasing the following candidate as Apache Spark
>> version 3.3.1.
>> >
>> > The vote is open until 11:59pm Pacific time October 3th and passes if a
>> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> >
>> > [ ] +1 Release this package as Apache Spark 3.3.1
>> > [ ] -1 Do not release this package because ...
>> >
>> > To learn more about Apache Spark, please see https://spark.apache.org
>> >
>> > The tag to be voted on is v3.3.1-rc2 (commit
>> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
>> > https://github.com/apache/spark/tree/v3.3.1-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>> >
>> > Signatures used for Spark RCs can be found in this file:
>> > https://dist.apache.org/repos/dist/dev/spark/KEYS
>> >
>> > The staging repository for this release can be found at:
>> > https://repository.apache.org/content/repositories/orgapachespark-1421
>> >
>> > The documentation corresponding to this release can be found at:
>> > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>> >
>> > The list of bug fixes going into 3.3.1 can be found at the following
>> URL:
>> > https://issues.apache.org/jira/projects/SPARK/versions/12351710
>> >
>> > This release is using the release script of the tag v3.3.1-rc2.
>> >
>> >
>> > FAQ
>> >
>> > =
>> > How can I help test this release?
>> > =
>> > If you are a Spark user, you can help us test this release by taking
>> > an existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > If you're working in PySpark you can set up a virtual env and install
>> > the current RC and see if anything important breaks, in the Java/Scala
>> > you can add the staging repository to your projects resolvers and test
>> > with the RC (make sure to clean up the artifact cache before/after so
>> > you don't end up building with a out of date RC going forward).
>> >
>> > ===
>> > What should happen to JIRA tickets still targeting 3.3.1?
>> > ===
>> > The current list of open tickets targeted at 3.3.1 can be found at:
>> > https://issues.apache.org/jira/projects/SPARK and search for "Target
>> Version/s" = 3.3.1
>> >
>> > Committers should look at those and triage. Extremely important bug
>> > fixes, documentation, and API tweaks that impact compatibility should
>> > be worked on immediately. Everything else please retarget to an
>> > appropriate release.
>> >
>> > ==
>> > But my bug isn't fixed?
>> > ==
>> > In order to make timely releases, we will typically not hold the
>> > release unless the bug in question is a regression from the previous
>> > release. That being said, if there is something which is a regression
>> > that has not been correctly targeted please ping me or a committer to
>> > help target the issue.
>> >
>> >
>> >
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>


Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-03 Thread Gengliang Wang
+1. I ran some simple tests and also verified that SPARK-40389 is fixed.

Gengliang

On Mon, Oct 3, 2022 at 8:56 AM Thomas Graves  wrote:

> +1. ran out internal tests and everything looks good.
>
> Tom Graves
>
> On Wed, Sep 28, 2022 at 12:20 AM Yuming Wang  wrote:
> >
> > Please vote on releasing the following candidate as Apache Spark version
> 3.3.1.
> >
> > The vote is open until 11:59pm Pacific time October 3th and passes if a
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
> >
> > [ ] +1 Release this package as Apache Spark 3.3.1
> > [ ] -1 Do not release this package because ...
> >
> > To learn more about Apache Spark, please see https://spark.apache.org
> >
> > The tag to be voted on is v3.3.1-rc2 (commit
> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
> > https://github.com/apache/spark/tree/v3.3.1-rc2
> >
> > The release files, including signatures, digests, etc. can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
> >
> > Signatures used for Spark RCs can be found in this file:
> > https://dist.apache.org/repos/dist/dev/spark/KEYS
> >
> > The staging repository for this release can be found at:
> > https://repository.apache.org/content/repositories/orgapachespark-1421
> >
> > The documentation corresponding to this release can be found at:
> > https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
> >
> > The list of bug fixes going into 3.3.1 can be found at the following URL:
> > https://issues.apache.org/jira/projects/SPARK/versions/12351710
> >
> > This release is using the release script of the tag v3.3.1-rc2.
> >
> >
> > FAQ
> >
> > =
> > How can I help test this release?
> > =
> > If you are a Spark user, you can help us test this release by taking
> > an existing Spark workload and running on this release candidate, then
> > reporting any regressions.
> >
> > If you're working in PySpark you can set up a virtual env and install
> > the current RC and see if anything important breaks, in the Java/Scala
> > you can add the staging repository to your projects resolvers and test
> > with the RC (make sure to clean up the artifact cache before/after so
> > you don't end up building with a out of date RC going forward).
> >
> > ===
> > What should happen to JIRA tickets still targeting 3.3.1?
> > ===
> > The current list of open tickets targeted at 3.3.1 can be found at:
> > https://issues.apache.org/jira/projects/SPARK and search for "Target
> Version/s" = 3.3.1
> >
> > Committers should look at those and triage. Extremely important bug
> > fixes, documentation, and API tweaks that impact compatibility should
> > be worked on immediately. Everything else please retarget to an
> > appropriate release.
> >
> > ==
> > But my bug isn't fixed?
> > ==
> > In order to make timely releases, we will typically not hold the
> > release unless the bug in question is a regression from the previous
> > release. That being said, if there is something which is a regression
> > that has not been correctly targeted please ping me or a committer to
> > help target the issue.
> >
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: EXT: Re: Missing string replace function

2022-10-03 Thread Vibhor Gupta
Hi Khalid,

See https://issues.apache.org/jira/browse/SPARK-31628.

It might just be a syntactic sugar over the 
StringReplace
 class, but it makes the things a little easier and neater.

There are a lot of such missing APIs in scala and python.

Regards,
Vibhor



From: russell.spit...@gmail.com 
Sent: Monday, October 3, 2022 12:31 AM
To: Khalid Mammadov 
Cc: dev 
Subject: EXT: Re: Missing string replace function

EXTERNAL: Report suspicious emails to Email Abuse.

Ah for that I think it makes sense to add in a function but it probably should 
not be an alias for regex replace since that has very different semantics for 
certain string arguments

Sent from my iPhone

On Oct 2, 2022, at 1:31 PM, Khalid Mammadov  wrote:


Thanks Russell for checking this out!

This is a good example of a replace which is available in the Sapk SQL but not 
in the PySpark API nor in Scala API unfortunately.
Another alternative to this is mentioned regexp_replace, but as a developer 
looking for replace function we tend to ignore regex version as it's not what 
we usually look for and then realise there is not built in replace utility 
function and have to use regexp alternative.

So, to give an example, it is possible now to do something like this:
scala> val df = Seq("aaa zzz").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.select(expr("replace(value, 'aaa', 'bbb')")).show()
++
|replace(value, aaa, bbb)|
++
| bbb zzz|
++

But not this:
df.select(replace('value, "aaa", "ooo")).show()
as replace function is not available in functions modules both PySpark and 
Scala.

And this is the output from my local prototype which would be good to see in 
the official API:
scala> df.select(replace('value, "aaa", "ooo")).show()
+--+
|regexp_replace(value, aaa, ooo, 1)|
+--+
|   ooo zzz|
+--+

WDYT?


On Sun, Oct 2, 2022 at 6:24 PM Russell Spitzer 
mailto:russell.spit...@gmail.com>> wrote:
Quick test on on 3.2 confirms everything should be working as expected

scala> spark.createDataset(Seq(("foo", "bar")))
res0: org.apache.spark.sql.Dataset[(String, String)] = [_1: string, _2: string]

scala> spark.createDataset(Seq(("foo", "bar"))).createTempView("temp")

scala> spark.sql("SELECT replace(_1, 'fo', 'bo') from temp").show
+---+
|replace(_1, fo, bo)|
+---+
|boo|
+---+

On Oct 2, 2022, at 12:21 PM, Russell Spitzer 
mailto:russell.spit...@gmail.com>> wrote:

https://spark.apache.org/docs/3.3.0/api/sql/index.html#replace

This was added in Spark 2.3.0 as far as I can tell.

https://github.com/apache/spark/pull/18047

On Oct 2, 2022, at 11:19 AM, Khalid Mammadov 
mailto:khalidmammad...@gmail.com>> wrote:

Hi,

As you know there's no string "replace" function inside pyspark.sql.functions 
for PySpark nor in org.apache.sql.functions for Scala/Java and was wondering 
why is that so? And I know there's regexp_replace instead and na.replace or SQL 
with expr.

I think it's one of the fundamental functions in users/developers toolset and 
available almost in every language. It takes time for new Spark devs to realise 
it's not there and to use alternative ones. So, I think it would be nice to 
have one.
I had already got a prototype for Scala (which is just a sugar over 
regexp_replace) and works like a charm:)

Would like to know your opinion to contribute or not needed...

Thanks
Khalid





Re: [VOTE] Release Spark 3.3.1 (RC2)

2022-10-03 Thread Thomas Graves
+1. ran out internal tests and everything looks good.

Tom Graves

On Wed, Sep 28, 2022 at 12:20 AM Yuming Wang  wrote:
>
> Please vote on releasing the following candidate as Apache Spark version 
> 3.3.1.
>
> The vote is open until 11:59pm Pacific time October 3th and passes if a 
> majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>
> [ ] +1 Release this package as Apache Spark 3.3.1
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see https://spark.apache.org
>
> The tag to be voted on is v3.3.1-rc2 (commit 
> 1d3b8f7cb15283a1e37ecada6d751e17f30647ce):
> https://github.com/apache/spark/tree/v3.3.1-rc2
>
> The release files, including signatures, digests, etc. can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-bin
>
> Signatures used for Spark RCs can be found in this file:
> https://dist.apache.org/repos/dist/dev/spark/KEYS
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1421
>
> The documentation corresponding to this release can be found at:
> https://dist.apache.org/repos/dist/dev/spark/v3.3.1-rc2-docs
>
> The list of bug fixes going into 3.3.1 can be found at the following URL:
> https://issues.apache.org/jira/projects/SPARK/versions/12351710
>
> This release is using the release script of the tag v3.3.1-rc2.
>
>
> FAQ
>
> =
> How can I help test this release?
> =
> If you are a Spark user, you can help us test this release by taking
> an existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> If you're working in PySpark you can set up a virtual env and install
> the current RC and see if anything important breaks, in the Java/Scala
> you can add the staging repository to your projects resolvers and test
> with the RC (make sure to clean up the artifact cache before/after so
> you don't end up building with a out of date RC going forward).
>
> ===
> What should happen to JIRA tickets still targeting 3.3.1?
> ===
> The current list of open tickets targeted at 3.3.1 can be found at:
> https://issues.apache.org/jira/projects/SPARK and search for "Target 
> Version/s" = 3.3.1
>
> Committers should look at those and triage. Extremely important bug
> fixes, documentation, and API tweaks that impact compatibility should
> be worked on immediately. Everything else please retarget to an
> appropriate release.
>
> ==
> But my bug isn't fixed?
> ==
> In order to make timely releases, we will typically not hold the
> release unless the bug in question is a regression from the previous
> release. That being said, if there is something which is a regression
> that has not been correctly targeted please ping me or a committer to
> help target the issue.
>
>
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Depolying stage-level scheduling for Spark SQL

2022-10-03 Thread Tom Graves
 1) In my opinion this is to complex for the average user. In this case I'm 
assuming you have some sort of optimizer that would apply and do it 
automatically for the user?  If its just in the research stage of things can 
you just modify Spark to do experiments?
2) I think the main thing is having the heuristics and logic for changing what 
the user requested.  it sounds like you might be working on a component to do 
this but I didn't read the paper you pointed to yet either.
Also note there are already plugin points into Spark to add rules to optimizer 
and physical plan for columnar, it sounds to me you might be working on 
something that might fit better as a plugin if it automatically figures out 
what it thinks the best thing is.  If this is the case I go back to number 1 
above, can you modify spark to have the plugin point you need to do your 
experimentation to see if it makes sense.
TomOn Friday, September 30, 2022, 11:31:35 AM CDT, Chenghao Lyu 
 wrote:  
 
 Thanks for the clarification Tom!

A bit more backgrounds for what we want to do: we have proposed a fine-grained 
(stage-level) resource optimization approach in VLDB22 
https://www.vldb.org/pvldb/vol15/p3098-lyu.pdf and would like to try it over 
Spark. Our approach can recommend the resource configuration for each stage 
automatically (by using ML and our optimization framework), and we would like 
to see how to embed it in Spark. Initially, we consider that there is no AQE to 
make it simpler. 

Now I see the problem in two folds (In both cases, the stage-level 
configurations will be automatically configured by our algorithm with the the 
upper and lower bounds of each tunable resource given by a user):

(1) If AQE is disabled in Spark SQL, and hence the RDD DAG will not be changed 
after the physical plan is selected, do you think it is feasible and worth 
exposing the RDDs and reusing the existing stage-level scheduling API for 
optimization? 
(2) If AQE is enabled in Spark SQL, I would agree and prefer to add the 
stage-level resource optimization inside the AQE. Since I am not very 
experienced with the AQE part, would you list more potential challenges it may 
lead to? 

Thanks in advance and I would really appreciate it if you could give us more 
feedback!
Cheers, ChenghaoOn Sep 30, 2022, 4:22 PM +0200, Tom Graves 
, wrote:

see the original SPIP for as to why we only support RDD: 
https://issues.apache.org/jira/browse/SPARK-27495

The main problem is exactly what you are referring to. The RDD level is not 
exposed to the user when using SQL or Dataframe API. This is on purpose and 
user shouldn't have to know anything about the underlying impelementation using 
RDDs. Especially with AQE and other optimizations that could change things. You 
may start out with one physical plan and AQE can change it along the way, so 
how does user change RDD at that point?   It would be very difficult to expose 
this to the user and I don't think it should be.  I think we would have to come 
up with some other way to apply stage level scheduling to SQL/dataframe, or 
like mentioned in original issue if AQE gets smart enough it would just do it 
for the user, but lots of factors that come into play that make that difficult 
as well.
Tom On Friday, September 30, 2022, 04:15:36 AM CDT, Chenghao Lyu 
 wrote:

Thanks for the reply! 

To clarify, for issue 2, it could still break apart a query into multiple jobs 
without AQE — I have turned off the AQE in my posted example. 

For 1, an end user just needs to turn on/off a knob to use the stage-level 
scheduling for Spark SQL — I am considering adding a component between the 
Spark SQL module and the Spark Core model to optimize the stage-level resource. 

Yes, SQL is declarative. It uses a sequence of components (such as a logical 
planner, physical planner, and CBO) to get a selected physical plan. The RDDs 
(with the transformations) are generated based on the selected physical plan 
for execution. For now, we could only get the top-level RDD of the DAG of RDDs 
by `spark.sql(q1).queryExecution.toRdd`, but it is not enough to make 
stage-level scheduling decisions. The stage-level resources are profiled based 
on the RDDs. If we could expose the all RDDs instead of the top-level RDD, it 
seems possible to apply the stage-level scheduling here.


P.S. let me attach the link for the RDD regeneration explicitly in case it is 
not shown on the mail-list website: 
https://stackoverflow.com/questions/73895506/how-to-avoid-rdd-regeneration-in-spark-sql
Cheers,ChenghaoOn Sep 29, 2022, 5:22 PM +0200, Herman van Hovell 
, wrote:

I think issue 2 is caused by adaptive query execution. This will break apart 
queries into multiple jobs, each subsequent job will generate a RDD that is 
based on previous ones. 
As for 1. I am not sure how much you want to expose to an end user here. SQL is 
declarative, and it does not specify how a query should be executed. I can 
imagine that you might use different