[jira] [Updated] (SPARK-45767) Delete `TimeStampedHashMap` and its UT

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45767:
---
Labels: pull-request-available  (was: )

> Delete `TimeStampedHashMap` and its UT
> --
>
> Key: SPARK-45767
> URL: https://issues.apache.org/jira/browse/SPARK-45767
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Trivial
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45767) Delete `TimeStampedHashMap` and its UT

2023-11-01 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-45767:
---

 Summary: Delete `TimeStampedHashMap` and its UT
 Key: SPARK-45767
 URL: https://issues.apache.org/jira/browse/SPARK-45767
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2023-11-01 Thread Asif (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781976#comment-17781976
 ] 

Asif commented on SPARK-36786:
--

I had put this on back burner as my changes were on 3.2, so I have to do a 
merge . on latest. Though whatever optimizations I did on 3.2 are still 
applicable as the drawback still exist. But chnages are going to be a a little 
extensive.
If there is interest on it I can pick up , after some days as right now 
occupied with another spip which proposes chnages for improving perf of 
broadcast hash joins on non partition column joins.
 

> SPIP: Improving the compile time performance, by improving  a couple of 
> rules, from 24 hrs to under 8 minutes
> -
>
> Key: SPARK-36786
> URL: https://issues.apache.org/jira/browse/SPARK-36786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> The aim is to improve the compile time performance of query which in 
> WorkDay's use case takes > 24 hrs ( & eventually fails) , to  < 8 min.
> To explain the problem, I will provide the context.
> The query plan in our production system, is huge, with nested *case when* 
> expressions ( level of nesting could be >  8) , where each *case when* can 
> have branches sometimes > 1000.
> The plan could look like
> {quote}Project1
>     |
>    Filter 1
>     |
> Project2
>     |
>  Filter2
>     |
>  Project3
>     |
>  Filter3
>   |
> Join
> {quote}
> Now the optimizer has a Batch of Rules , intended to run at max 100 times.
> *Also note that the, the batch will continue to run till one of the condition 
> is satisfied*
> *i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
> achieved)*
> One of the early  Rule is   *PushDownPredicateRule.*
> **Followed by **CollapseProject**.
>  
> The first issue is *PushDownPredicate* rule.
> It picks  one filter at a time & pushes it at lowest level ( I understand 
> that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but 
> either case it picks 1 filter at time starting from top, in each iteration.
> *The above comment is no longer true in 3.1 release as it now combines 
> filters. so it does push now all the encountered filters in a single pass. 
> But it still materializes the filter on each push by realiasing.*
> So if there are say  50 projects interspersed with Filters , the idempotency 
> is guaranteedly not going to get achieved till around 49 iterations. 
> Moreover, CollapseProject will also be modifying tree on each iteration as a 
> filter will get removed within Project.
> Moreover, on each movement of filter through project tree, the filter is 
> re-aliased using transformUp rule.  transformUp is very expensive compared to 
> transformDown. As the filter keeps getting pushed down , its size increases.
> To optimize this rule , 2 things are needed
>  # Instead of pushing one filter at a time,  collect all the filters as we 
> traverse the tree in that iteration itself.
>  # Do not re-alias the filters on each push. Collect the sequence of projects 
> it has passed through, and  when the filters have reached their resting 
> place, do the re-alias by processing the projects collected in down to up 
> manner.
> This will result in achieving idempotency in a couple of iterations. 
> *How reducing the number of iterations help in performance*
> There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals 
> ( ... there are around 6 more such rules)*  which traverse the tree using 
> transformUp, and they run unnecessarily in each iteration , even when the 
> expressions in an operator have not changed since the previous runs.
> *I have a different proposal which I will share later, as to how to avoid the 
> above rules from running unnecessarily, if it can be guaranteed that the 
> expression is not going to mutate in the operator.* 
> The cause of our huge compilation time has been identified as the above.
>   
> h2. Q2. What problem is this proposal NOT designed to solve?
> It is not going to change any runtime profile.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Like mentioned above , currently PushDownPredicate pushes one filter at a 
> time  & at each Project , it materialized the re-aliased filter.  This 
> results in large number of iterations to achieve idempotency as well as 
> immediate materialization of Filter after each Project pass,, results in 
> unnecessary tree traversals of filter expression that too using transformUp. 
> and the expression tree of filter i

[jira] [Commented] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2023-11-01 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781966#comment-17781966
 ] 

Abhinav Kumar commented on SPARK-36786:
---

[~ashahid7] [~adou...@sqli.com] where are we on this one?

> SPIP: Improving the compile time performance, by improving  a couple of 
> rules, from 24 hrs to under 8 minutes
> -
>
> Key: SPARK-36786
> URL: https://issues.apache.org/jira/browse/SPARK-36786
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.1, 3.1.2
>Reporter: Asif
>Priority: Major
>  Labels: SPIP
>
> h2. Q1. What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> The aim is to improve the compile time performance of query which in 
> WorkDay's use case takes > 24 hrs ( & eventually fails) , to  < 8 min.
> To explain the problem, I will provide the context.
> The query plan in our production system, is huge, with nested *case when* 
> expressions ( level of nesting could be >  8) , where each *case when* can 
> have branches sometimes > 1000.
> The plan could look like
> {quote}Project1
>     |
>    Filter 1
>     |
> Project2
>     |
>  Filter2
>     |
>  Project3
>     |
>  Filter3
>   |
> Join
> {quote}
> Now the optimizer has a Batch of Rules , intended to run at max 100 times.
> *Also note that the, the batch will continue to run till one of the condition 
> is satisfied*
> *i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
> achieved)*
> One of the early  Rule is   *PushDownPredicateRule.*
> **Followed by **CollapseProject**.
>  
> The first issue is *PushDownPredicate* rule.
> It picks  one filter at a time & pushes it at lowest level ( I understand 
> that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but 
> either case it picks 1 filter at time starting from top, in each iteration.
> *The above comment is no longer true in 3.1 release as it now combines 
> filters. so it does push now all the encountered filters in a single pass. 
> But it still materializes the filter on each push by realiasing.*
> So if there are say  50 projects interspersed with Filters , the idempotency 
> is guaranteedly not going to get achieved till around 49 iterations. 
> Moreover, CollapseProject will also be modifying tree on each iteration as a 
> filter will get removed within Project.
> Moreover, on each movement of filter through project tree, the filter is 
> re-aliased using transformUp rule.  transformUp is very expensive compared to 
> transformDown. As the filter keeps getting pushed down , its size increases.
> To optimize this rule , 2 things are needed
>  # Instead of pushing one filter at a time,  collect all the filters as we 
> traverse the tree in that iteration itself.
>  # Do not re-alias the filters on each push. Collect the sequence of projects 
> it has passed through, and  when the filters have reached their resting 
> place, do the re-alias by processing the projects collected in down to up 
> manner.
> This will result in achieving idempotency in a couple of iterations. 
> *How reducing the number of iterations help in performance*
> There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals 
> ( ... there are around 6 more such rules)*  which traverse the tree using 
> transformUp, and they run unnecessarily in each iteration , even when the 
> expressions in an operator have not changed since the previous runs.
> *I have a different proposal which I will share later, as to how to avoid the 
> above rules from running unnecessarily, if it can be guaranteed that the 
> expression is not going to mutate in the operator.* 
> The cause of our huge compilation time has been identified as the above.
>   
> h2. Q2. What problem is this proposal NOT designed to solve?
> It is not going to change any runtime profile.
> h2. Q3. How is it done today, and what are the limits of current practice?
> Like mentioned above , currently PushDownPredicate pushes one filter at a 
> time  & at each Project , it materialized the re-aliased filter.  This 
> results in large number of iterations to achieve idempotency as well as 
> immediate materialization of Filter after each Project pass,, results in 
> unnecessary tree traversals of filter expression that too using transformUp. 
> and the expression tree of filter is bound to keep increasing as it is pushed 
> down.
> h2. Q4. What is new in your approach and why do you think it will be 
> successful?
> In the new approach we push all the filters down in a single pass. And do not 
> materialize filters as it pass through Project. Instead keep collecting 
> projects in sequential order and materialize the final filter once i

[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)

2023-11-01 Thread Abhinav Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781959#comment-17781959
 ] 

Abhinav Kumar commented on SPARK-33164:
---

I see value in some use cases like [~arnaud.nauwynck] mentions. But there is 
this "SELECT *" very well documented risks, leading to maintainability issues. 
Should we still be trying to implement this?

> SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to 
> DataSet.dropColumn(someColumn)
> 
>
> Key: SPARK-33164
> URL: https://issues.apache.org/jira/browse/SPARK-33164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1
>Reporter: Arnaud Nauwynck
>Priority: Minor
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM 
> .." 
> to be able to select all columns except some in a SELECT clause.
> It would be similar to SQL syntax from some databases, like Google BigQuery 
> or PostgresQL.
> https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax
> Google question "select * EXCEPT one column", and you will see many 
> developpers have the same problems.
> example posts: 
> https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/
> https://www.thetopsites.net/article/53001825.shtml
> There are several typicall examples where is is very helpfull :
> use-case1:
>  you add "count ( * )  countCol" column, and then filter on it using for 
> example "having countCol = 1" 
>   ... and then you want to select all columns EXCEPT this dummy column which 
> always is "1"
> {noformat}
>   select * (EXCEPT countCol)
>   from (  
>  select count(*) countCol, * 
>from MyTable 
>where ... 
>group by ... having countCol = 1
>   )
> {noformat}
>
> use-case 2:
>  same with analytical function "partition over(...) rankCol  ... where 
> rankCol=1"
>  For example to get the latest row before a given time, in a time series 
> table.
>  This is "Time-Travel" queries addressed by framework like "DeltaLake"
> {noformat}
>  CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 
> type2, ... col42)
>  pastTime=..
>  SELECT * (except rankCol)
>  FROM (
>SELECT *,
>   RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol   
>FROM t_updates
>where update_time < pastTime
>  ) WHERE rankCol = 1
>  
> {noformat}
>  
> use-case 3:
>  copy some data from table "t" to corresponding table "t_snapshot", and back 
> to "t"
> {noformat}
>CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ...
>
>/* create corresponding table: (snap_id string, col1 type1, col2 type2, 
> col3 type3, ... col42 type42) */
>CREATE TABLE t_snapshot
>AS SELECT '' as snap_id, * FROM t WHERE 1=2
>/* insert data from t to some snapshot */
>INSERT INTO t_snapshot
>SELECT 'snap1' as snap_id, * from t 
>
>/* select some data from snapshot table (without snap_id column) .. */   
>SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' 
>
> {noformat}
>
>
> *Q2.* What problem is this proposal NOT designed to solve?
> It is only a SQL syntaxic sugar. 
> It does not change SQL execution plan or anything complex.
> *Q3.* How is it done today, and what are the limits of current practice?
>  
> Today, you can either use the DataSet API, with .dropColumn(someColumn)
> or you need to HARD-CODE manually all columns in your SQL. Therefore your 
> code is NOT generic (or you are using a SQL meta-code generator?)
> *Q4.* What is new in your approach and why do you think it will be successful?
> It is NOT new... it is already a proven solution from DataSet.dropColumn(), 
> Postgresql, BigQuery
>  
> *Q5.* Who cares? If you are successful, what difference will it make?
> It simplifies life of developpers, dba, data analysts, end users.
> It simplify development of SQL code, in a more generic way for many tasks.
> *Q6.* What are the risks?
> There is VERY limited risk on spark SQL, because it already exists in DataSet 
> API.
> It is an extension of SQL syntax, so the risk is annoying some IDE SQL 
> editors for a new SQL syntax. 
> *Q7.* How long will it take?
> No idea. I guess someone experienced in the Spark SQL internals might do it 
> relatively "quickly".
> It is a kind of syntaxic sugar to add in antlr grammar rule, then transform 
> in DataSet api
> *Q8.* What are the mid-term and final “exams” to check for success?
> The 3 standard use-cases given in question Q1.



--
This mes

[jira] [Resolved] (SPARK-45761) Upgrade `Volcano` to 1.8.1

2023-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45761.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43624
[https://github.com/apache/spark/pull/43624]

> Upgrade `Volcano` to 1.8.1
> --
>
> Key: SPARK-45761
> URL: https://issues.apache.org/jira/browse/SPARK-45761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes, Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> To bring the latest feature and bug fixes in addition to the test coverage 
> for Volcano scheduler 1.8.1.
> [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1]
>  
> [https://github.com/volcano-sh/volcano/pull/3101 
> |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 
> volcano-sh/volcano#3101)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44419:
---
Labels: pull-request-available  (was: )

> Support to extract partial filters of datasource v2 table and push them down
> 
>
> Key: SPARK-44419
> URL: https://issues.apache.org/jira/browse/SPARK-44419
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0
>Reporter: caican
>Priority: Major
>  Labels: pull-request-available
>
>  
> Run the following sql, and the date predicate in the where clause is not 
> pushed down and it would cause a full table scan.
>  
> {code:java}
> SELECT
> id,
> data,
> date
> FROM
> testcat.db.table
> where
> (date = 20221110 and udfStrLen(data) = 8)
> or
> (date = 2022 and udfStrLen(data) = 8)  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44426) optimize adaptive skew join for ExistenceJoin

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44426:
---
Labels: pull-request-available  (was: )

> optimize adaptive skew join for ExistenceJoin
> -
>
> Key: SPARK-44426
> URL: https://issues.apache.org/jira/browse/SPARK-44426
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0
>Reporter: caican
>Priority: Major
>  Labels: pull-request-available
>
> For this query,  InSubQuery would be cast to ExistenceJoin and now 
> ExistenceJoin does not support automatic data skew for the left table.
> {code:java}
> SELECT * FROM skewData1
> where
> (key1 in (select key2 from skewData2)
> or value1 in (select value2 from skewData2){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45680) ReleaseSession to close Spark Connect session

2023-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-45680:


Assignee: Juliusz Sompolski

> ReleaseSession to close Spark Connect session
> -
>
> Key: SPARK-45680
> URL: https://issues.apache.org/jira/browse/SPARK-45680
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45680) ReleaseSession to close Spark Connect session

2023-11-01 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45680.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43546
[https://github.com/apache/spark/pull/43546]

> ReleaseSession to close Spark Connect session
> -
>
> Key: SPARK-45680
> URL: https://issues.apache.org/jira/browse/SPARK-45680
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Juliusz Sompolski
>Assignee: Juliusz Sompolski
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45766) ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions.

2023-11-01 Thread Piotr Szul (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Szul updated SPARK-45766:
---
Description: 
We have a custom encoder for union like objects. 

The our custom serializer uses an expression like:

{{If(IsNull(If(.)), Literal(null), NamedStruct()))}}

Using this encoder in a SQL expression that applies the 

`org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning`  rule  
results in the exception below.

It's because the expression it transformed by `PushFoldableIntoBranches' rule 
prior to `ObjectSerializerPruning`, which changes the expression to:

{{If(If(.), Literal(null), NamedStruct()))}}

which no longer matches the expressions for which null type alignment is 
performed.

See the attached scala repl code for the demonstration of this issue.

 

The exception:

 

java.lang.IllegalArgumentException: requirement failed: All input types must be 
the same except nullable, containsNull, valueContainsNull flags. The expression 
is: if (if (assertnotnull(input[0, UnionType, true]).hasValue) 
isnull(assertnotnull(input[0, UnionType, true]).value) else true) null else 
named_struct(given, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(knownnotnull(assertnotnull(input[0, UnionType, 
true])).value).given, true, false, true)). The input types found are

StructType(StructField(given,StringType,true),StructField(family,StringType,true))

StructType(StructField(given,StringType,true)).

  at scala.Predef$.require(Predef.scala:281)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1304)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1297)

  at 
org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1309)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1308)

  at 
org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1313)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1313)

  at 
org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:166)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.pruneSerializer(objects.scala:209)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.$anonfun$applyOrElse$3(objects.scala:230)

  at scala.collection.immutable.List.map(List.scala:293)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:229)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:217)

  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)

  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)

  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)

  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)

  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)

  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformWithPruning(TreeNode.scala:427)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:217)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:125)

  at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:222)

 

  was:
We have a custom encoder fo

[jira] [Created] (SPARK-45766) ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions.

2023-11-01 Thread Piotr Szul (Jira)
Piotr Szul created SPARK-45766:
--

 Summary: ObjectSerializerPruning fails to align null types in 
custom serializer 'If' expressions.
 Key: SPARK-45766
 URL: https://issues.apache.org/jira/browse/SPARK-45766
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1, 3.3.3
Reporter: Piotr Szul
 Attachments: prunning_bug.scala

We have a custom encoder for union like objects. 

The our custom serializer uses an expression like:

{{If(IsNull(If(.)), Literal(null), NamedStruct()))}}

Using this encoder in a SQL expression that applies the 

`org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning`  rule  
results in the exception below.

It's because the expression it transformed by `PushFoldableIntoBranches' rule 
prior to `ObjectSerializerPruning`, which changes the expression to:

{{If(If(.), Literal(null), NamedStruct()))}}

which no longer matches the expression for which null type alignment is 
performed.

See the attached scala repl code for the demonstration of this issue.

 

The exception:

 

java.lang.IllegalArgumentException: requirement failed: All input types must be 
the same except nullable, containsNull, valueContainsNull flags. The expression 
is: if (if (assertnotnull(input[0, UnionType, true]).hasValue) 
isnull(assertnotnull(input[0, UnionType, true]).value) else true) null else 
named_struct(given, staticinvoke(class 
org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
knownnotnull(knownnotnull(assertnotnull(input[0, UnionType, 
true])).value).given, true, false, true)). The input types found are

 
StructType(StructField(given,StringType,true),StructField(family,StringType,true))

 StructType(StructField(given,StringType,true)).

  at scala.Predef$.require(Predef.scala:281)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1304)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1297)

  at 
org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1309)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1308)

  at 
org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1313)

  at 
org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1313)

  at 
org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:41)

  at 
org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:166)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.pruneSerializer(objects.scala:209)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.$anonfun$applyOrElse$3(objects.scala:230)

  at scala.collection.immutable.List.map(List.scala:293)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:229)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:217)

  at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)

  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)

  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)

  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)

  at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)

  at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformWithPruning(TreeNode.scala:427)

  at 
org.apache.spark.sql.catalyst.optimizer.ObjectSeria

[jira] [Updated] (SPARK-45766) ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions.

2023-11-01 Thread Piotr Szul (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Piotr Szul updated SPARK-45766:
---
Attachment: prunning_bug.scala

> ObjectSerializerPruning fails to align null types in custom serializer 'If' 
> expressions.
> 
>
> Key: SPARK-45766
> URL: https://issues.apache.org/jira/browse/SPARK-45766
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Piotr Szul
>Priority: Minor
> Attachments: prunning_bug.scala
>
>
> We have a custom encoder for union like objects. 
> The our custom serializer uses an expression like:
> {{If(IsNull(If(.)), Literal(null), NamedStruct()))}}
> Using this encoder in a SQL expression that applies the 
> `org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning`  rule  
> results in the exception below.
> It's because the expression it transformed by `PushFoldableIntoBranches' rule 
> prior to `ObjectSerializerPruning`, which changes the expression to:
> {{If(If(.), Literal(null), NamedStruct()))}}
> which no longer matches the expression for which null type alignment is 
> performed.
> See the attached scala repl code for the demonstration of this issue.
>  
> The exception:
>  
> java.lang.IllegalArgumentException: requirement failed: All input types must 
> be the same except nullable, containsNull, valueContainsNull flags. The 
> expression is: if (if (assertnotnull(input[0, UnionType, true]).hasValue) 
> isnull(assertnotnull(input[0, UnionType, true]).value) else true) null else 
> named_struct(given, staticinvoke(class 
> org.apache.spark.unsafe.types.UTF8String, StringType, fromString, 
> knownnotnull(knownnotnull(assertnotnull(input[0, UnionType, 
> true])).value).given, true, false, true)). The input types found are
>  
> StructType(StructField(given,StringType,true),StructField(family,StringType,true))
>  StructType(StructField(given,StringType,true)).
>   at scala.Predef$.require(Predef.scala:281)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1304)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1297)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1309)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1308)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1313)
>   at 
> org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1313)
>   at 
> org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:41)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:166)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.pruneSerializer(objects.scala:209)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.$anonfun$applyOrElse$3(objects.scala:230)
>   at scala.collection.immutable.List.map(List.scala:293)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:229)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:217)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.sc

[jira] [Updated] (SPARK-45765) Improve error messages when loading multiple paths in PySpark

2023-11-01 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45765:
-
Description: 
Currently, the error message is super confusing when a user tries to load 
multiple paths incorrectly.

For example, `spark.read.format("json").load("p1", "p2")` will have this error:

An error occurred while calling o36.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: p2. Please find packages at 
`[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02

This can be confusing but it's valid error message, as "p2" will be considered 
as the `format` field of the load() method. 

  was:
Currently, the error message is super confusing when a user tries to load 
multiple paths incorrectly.

For example, `spark.read.format("json").load("p1", "p2")` will have this error:

An error occurred while calling o36.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: p2. Please find packages at 
`https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02

We should fix this.


> Improve error messages when loading multiple paths in PySpark
> -
>
> Key: SPARK-45765
> URL: https://issues.apache.org/jira/browse/SPARK-45765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the error message is super confusing when a user tries to load 
> multiple paths incorrectly.
> For example, `spark.read.format("json").load("p1", "p2")` will have this 
> error:
> An error occurred while calling o36.load.
> : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] 
> Failed to find the data source: p2. Please find packages at 
> `[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02
> This can be confusing but it's valid error message, as "p2" will be 
> considered as the `format` field of the load() method. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45765) Improve error messages when loading multiple paths in PySpark

2023-11-01 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang resolved SPARK-45765.
--
Resolution: Invalid

> Improve error messages when loading multiple paths in PySpark
> -
>
> Key: SPARK-45765
> URL: https://issues.apache.org/jira/browse/SPARK-45765
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, the error message is super confusing when a user tries to load 
> multiple paths incorrectly.
> For example, `spark.read.format("json").load("p1", "p2")` will have this 
> error:
> An error occurred while calling o36.load.
> : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] 
> Failed to find the data source: p2. Please find packages at 
> `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02
> We should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45765) Improve error messages when loading multiple paths in PySpark

2023-11-01 Thread Allison Wang (Jira)
Allison Wang created SPARK-45765:


 Summary: Improve error messages when loading multiple paths in 
PySpark
 Key: SPARK-45765
 URL: https://issues.apache.org/jira/browse/SPARK-45765
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Currently, the error message is super confusing when a user tries to load 
multiple paths incorrectly.

For example, `spark.read.format("json").load("p1", "p2")` will have this error:

An error occurred while calling o36.load.
: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed 
to find the data source: p2. Please find packages at 
`https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02

We should fix this.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45756) Revisit and Improve Spark Standalone Cluster

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45756:
-

Assignee: Dongjoon Hyun

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45756) Revisit and Improve Spark Standalone Cluster

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45756.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45639) Support loading Python data sources in DataFrameReader

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45639:
---
Labels: pull-request-available  (was: )

> Support loading Python data sources in DataFrameReader
> --
>
> Key: SPARK-45639
> URL: https://issues.apache.org/jira/browse/SPARK-45639
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Allow users to read from a Python data source using 
> `spark.read.format(...).load()` in PySpark. For example
> Users can extend the DataSource and the DataSourceReader classes to create 
> their own Python data source reader and use them in PySpark:
> {code:java}
> class MyReader(DataSourceReader):
>     def read(self, partition):
>         yield (0, 1)
> class MyDataSource(DataSource):
>     def schema(self):
>         return "id INT, value INT"
>     
> def reader(self, schema):
>         return MyReader()
> df = spark.read.format("MyDataSource").load()
> df.show()
> +---+-+
> | id|value|
> +---+-+
> |  0|    1|
> +---+-+
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43972) Tests never succeed on pyspark 3.4.0 (work OK on pyspark 3.3.2)

2023-11-01 Thread Jamie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781917#comment-17781917
 ] 

Jamie commented on SPARK-43972:
---

This issue appears to be fixed in pyspark 3.5.0

 

Here's a run of the same tests: 
[https://github.com/jamiekt/jstark/actions/runs/6725570531/job/18280243390] 
that were [run on pyspark 
3.5.0|https://github.com/jamiekt/jstark/actions/runs/6725570531/job/18280243390#step:6:53].

> Tests never succeed on pyspark 3.4.0 (work OK on pyspark 3.3.2)
> ---
>
> Key: SPARK-43972
> URL: https://issues.apache.org/jira/browse/SPARK-43972
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.4.0
> Environment: Sorry, not sure what I'm supposed to put in this section.
>Reporter: Jamie
>Priority: Major
>
> I have a project that uses pyspark. The tests have always run fine on pyspark 
> versions prior to pyspark 3.4.0 but now fail on that version (which was 
> released on 2023-04-13).
> My project is configured to use the latest available version of pyspark:
> {code:json}
> dependencies = [
>   "pyspark",
>   "faker"
> ]
> {code}
> [https://github.com/jamiekt/jstark/blob/c1629cee4e4b8fb0b4471f6fc2941f1b0a99a4bf/pyproject.toml#L26-L29]
> The tests are run using GitHub Actions. An example of the failing tests is at 
> [https://github.com/jamiekt/jstark/actions/runs/4977164046], you can see 
> there that the tests are run upon various combinations of OS & python 
> version, they are all cancelled after running for over 5 hours.
> If I [pin the version of pyspark to 
> 3.3.2|https://github.com/jamiekt/jstark/commit/5fd7115d3719a7d6ef2547e8e35feb3ed76ee99f]
>  then the tests all succeed in ~10 minutes, see 
> [https://github.com/jamiekt/jstark/actions/runs/5061332947] for such a 
> successful run.
> 
> This can be reproduced by cloning the repository and running only one test. 
> The project uses hatch for managing environments and dependencies so you 
> would need that installed ({{{}pipx install hatch{}}}/{{{}brew install 
> hatch{}}}). I have reproduced the problem on python3.10.
> Reproduce the problem by running these commands:
> {code:bash}
> # force use of python3.10
> export HATCH_PYTHON=/path/to/python3.10
> git clone https://github.com/jamiekt/jstark.git
> cd jstark
> # following command will create a virtualenv & install all dependencies, 
> including pyspark 3.4.0
> hatch run pytest -k test_basketweeks_by_product_and_customer
> {code}
> On my machine this never completes. I need to CTRL+C to crash out of it. I 
> consider this to be equivalent behaviour to the tests that fail in the GitHub 
> Actions pipeline after 6 hours.
> Now let's checkout the branch which pins pyspark to 3.3.2 and run the same 
> thing (the hatch environment will get rebuilt with pyspark 3.3.2)
> {code:bash}
> git checkout try-pyspark3-3-2
> hatch run pytest -k test_basketweeks_by_product_and_customer
> {code}
> this time it succeeds in ~31seconds:
> {code:bash}
> ➜  hatch run pytest -k test_basketweeks_by_product_and_customer
> ==
>  test session starts 
> ==
> platform darwin -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0
> rootdir: /private/tmp/jstark
> plugins: Faker-18.9.0, cov-4.0.0
> collected 79 items / 78 deselected / 1 selected
> tests/test_grocery_retailer_feature_generator.py .
>   
>   
>[100%]
> = 1 passed, 
> 78 deselected in 31.30s =
> {code}
> That particular test constructs a very very complex pyspark dataframe which I 
> suspect might be contributing to the problem, however the issue here is that 
> it works on pyspark 3.3.2 but not on pyspark 3.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45763.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43628
[https://github.com/apache/spark/pull/43628]

> Improve `MasterPage` to show `Resource` column only when it exists
> --
>
> Key: SPARK-45763
> URL: https://issues.apache.org/jira/browse/SPARK-45763
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45763:
-

Assignee: Dongjoon Hyun

> Improve `MasterPage` to show `Resource` column only when it exists
> --
>
> Key: SPARK-45763
> URL: https://issues.apache.org/jira/browse/SPARK-45763
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45764) Make code block copyable

2023-11-01 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang updated SPARK-45764:
-
Description: 
We should consider adding a copy button next to the pyspark code blocks.

For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]

  was:
We should consider 

For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]


> Make code block copyable
> 
>
> Key: SPARK-45764
> URL: https://issues.apache.org/jira/browse/SPARK-45764
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should consider adding a copy button next to the pyspark code blocks.
> For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45764) Make code block copyable

2023-11-01 Thread Allison Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781887#comment-17781887
 ] 

Allison Wang commented on SPARK-45764:
--

cc [~podongfeng] WDYT?

> Make code block copyable
> 
>
> Key: SPARK-45764
> URL: https://issues.apache.org/jira/browse/SPARK-45764
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>
> We should consider adding a copy button next to the pyspark code blocks.
> For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45764) Make code block copyable

2023-11-01 Thread Allison Wang (Jira)
Allison Wang created SPARK-45764:


 Summary: Make code block copyable
 Key: SPARK-45764
 URL: https://issues.apache.org/jira/browse/SPARK-45764
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


We should consider 

For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45731) Update partition statistics with ANALYZE TABLE command

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45731:
---
Labels: pull-request-available  (was: )

> Update partition statistics with ANALYZE TABLE command
> --
>
> Key: SPARK-45731
> URL: https://issues.apache.org/jira/browse/SPARK-45731
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Chao Sun
>Priority: Major
>  Labels: pull-request-available
>
> Currently {{ANALYZE TABLE}} command only updates table-level stats but not 
> partition stats, even though it can be applied to both non-partitioned and 
> partitioned tables. It seems make sense for it to update partition stats as 
> well.
> Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, 
> but the syntax is more verbose as they need to specify all the partition 
> columns. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45754) Support `spark.deploy.appIdPattern`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45754.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43616
[https://github.com/apache/spark/pull/43616]

> Support `spark.deploy.appIdPattern`
> ---
>
> Key: SPARK-45754
> URL: https://issues.apache.org/jira/browse/SPARK-45754
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45763:
---
Labels: pull-request-available  (was: )

> Improve `MasterPage` to show `Resource` column only when it exists
> --
>
> Key: SPARK-45763
> URL: https://issues.apache.org/jira/browse/SPARK-45763
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists

2023-11-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45763:
-

 Summary: Improve `MasterPage` to show `Resource` column only when 
it exists
 Key: SPARK-45763
 URL: https://issues.apache.org/jira/browse/SPARK-45763
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38473) Use error classes in org.apache.spark.scheduler

2023-11-01 Thread Hannah Amundson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781827#comment-17781827
 ] 

Hannah Amundson commented on SPARK-38473:
-

I am working on this ticket now!

> Use error classes in org.apache.spark.scheduler
> ---
>
> Key: SPARK-38473
> URL: https://issues.apache.org/jira/browse/SPARK-38473
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45761) Upgrade `Volcano` to 1.8.1

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45761:
-

Assignee: Dongjoon Hyun

> Upgrade `Volcano` to 1.8.1
> --
>
> Key: SPARK-45761
> URL: https://issues.apache.org/jira/browse/SPARK-45761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes, Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> To bring the latest feature and bug fixes in addition to the test coverage 
> for Volcano scheduler 1.8.1.
> [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1]
>  
> [https://github.com/volcano-sh/volcano/pull/3101 
> |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 
> volcano-sh/volcano#3101)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45761) Upgrade `Volcano` to 1.8.1

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45761:
--
Description: 
To bring the latest feature and bug fixes in addition to the test coverage for 
Volcano scheduler 1.8.1.

[https://github.com/volcano-sh/volcano/releases/tag/v1.8.1]

 

[https://github.com/volcano-sh/volcano/pull/3101 
|https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 
volcano-sh/volcano#3101)

> Upgrade `Volcano` to 1.8.1
> --
>
> Key: SPARK-45761
> URL: https://issues.apache.org/jira/browse/SPARK-45761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes, Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> To bring the latest feature and bug fixes in addition to the test coverage 
> for Volcano scheduler 1.8.1.
> [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1]
>  
> [https://github.com/volcano-sh/volcano/pull/3101 
> |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 
> volcano-sh/volcano#3101)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45762:
---
Labels: pull-request-available  (was: )

> Shuffle managers defined in user jars are not available for some launch modes
> -
>
> Key: SPARK-45762
> URL: https://issues.apache.org/jira/browse/SPARK-45762
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Alessandro Bellina
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Starting a spark job in standalone mode with a custom `ShuffleManager` 
> provided in a jar via `--jars` does not work. This can also be experienced in 
> local-cluster mode.
> The approach that works consistently is to copy the jar containing the custom 
> `ShuffleManager` to a specific location in each node then add it to 
> `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we 
> would like to move away from setting extra configurations unnecessarily.
> Example:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> This yields `java.lang.ClassNotFoundException` in the executors.
> {code:java}
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.examples.TestShuffleManager
>   at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
>   at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>   at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
>   at java.base/java.lang.Class.forName0(Native Method)
>   at java.base/java.lang.Class.forName(Class.java:467)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
>   at 
> org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:95)
>   at 
> org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574)
>   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
>   at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
>   at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
>   at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
>   at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
>   ... 4 more
> {code}
> We can change our command to use `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --conf spark.driver.extraClassPath=user-code.jar \
>  --conf spark.executor.extraClassPath=user-code.jar
> {code}
> Success after adding the jar to `extraClassPath`:
> {code:java}
> 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created 
> connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps)
> 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!!
> 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c
> {code}
> We would like to change startup order such that the original command 
> succeeds, without specifying `extraClassPath`:
> {code:java}
> $SPARK_HOME/bin/spark-shell \
>   --master spark://127.0.0.1:7077 \
>   --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
>   --jars user-code.jar
> {code}
> Proposed changes:
> Refactor code so we initialize the `ShuffleManager` later, after jars have 
> been localized. This is especially necessary in the executo

[jira] [Updated] (SPARK-45761) Upgrade `Volcano` to 1.8.1

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45761:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Bug)

> Upgrade `Volcano` to 1.8.1
> --
>
> Key: SPARK-45761
> URL: https://issues.apache.org/jira/browse/SPARK-45761
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Kubernetes, Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes

2023-11-01 Thread Alessandro Bellina (Jira)
Alessandro Bellina created SPARK-45762:
--

 Summary: Shuffle managers defined in user jars are not available 
for some launch modes
 Key: SPARK-45762
 URL: https://issues.apache.org/jira/browse/SPARK-45762
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.5.0
Reporter: Alessandro Bellina
 Fix For: 4.0.0


Starting a spark job in standalone mode with a custom `ShuffleManager` provided 
in a jar via `--jars` does not work. This can also be experienced in 
local-cluster mode.

The approach that works consistently is to copy the jar containing the custom 
`ShuffleManager` to a specific location in each node then add it to 
`spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we would 
like to move away from setting extra configurations unnecessarily.

Example:
{code:java}
$SPARK_HOME/bin/spark-shell \
  --master spark://127.0.0.1:7077 \
  --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
  --jars user-code.jar
{code}
This yields `java.lang.ClassNotFoundException` in the executors.
{code:java}
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915)
  at 
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
  at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436)
  at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425)
  at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.examples.TestShuffleManager
  at 
java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
  at 
java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
  at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520)
  at java.base/java.lang.Class.forName0(Native Method)
  at java.base/java.lang.Class.forName(Class.java:467)
  at 
org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41)
  at 
org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36)
  at org.apache.spark.util.Utils$.classForName(Utils.scala:95)
  at 
org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574)
  at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366)
  at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255)
  at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487)
  at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
  at 
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
  at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:712)
  at java.base/javax.security.auth.Subject.doAs(Subject.java:439)
  at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
  ... 4 more
{code}
We can change our command to use `extraClassPath`:
{code:java}
$SPARK_HOME/bin/spark-shell \
  --master spark://127.0.0.1:7077 \
  --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
  --conf spark.driver.extraClassPath=user-code.jar \
 --conf spark.executor.extraClassPath=user-code.jar
{code}
Success after adding the jar to `extraClassPath`:
{code:java}
23/10/26 12:58:26 INFO TransportClientFactory: Successfully created connection 
to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps)
23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!!
23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at 
/tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c
{code}
We would like to change startup order such that the original command succeeds, 
without specifying `extraClassPath`:
{code:java}
$SPARK_HOME/bin/spark-shell \
  --master spark://127.0.0.1:7077 \
  --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \
  --jars user-code.jar
{code}
Proposed changes:

Refactor code so we initialize the `ShuffleManager` later, after jars have been 
localized. This is especially necessary in the executor, where we would need to 
move this initialization until after the `replClassLoader` is updated with jars 
passed in `--jars`.

Today, the `ShuffleManager` is instantiated at `SparkEnv` creation. Having to 
instantiate the `ShuffleManager` this early doesn't work, because user jars 
have not been localized in all scenarios, and we will fail to load the 
`ShuffleManager`. We propose moving the `ShuffleManager` instantiation to 
`SparkContext` on the driver, and Executor, 

[jira] [Commented] (SPARK-38668) Spark on Kubernetes: add separate pod watcher service to reduce pressure on K8s API server

2023-11-01 Thread Hannah Amundson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781821#comment-17781821
 ] 

Hannah Amundson commented on SPARK-38668:
-

Hello,

I will start working on this now!

> Spark on Kubernetes: add separate pod watcher service to reduce pressure on 
> K8s API server
> --
>
> Key: SPARK-38668
> URL: https://issues.apache.org/jira/browse/SPARK-38668
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.1
>Reporter: BoYang
>Priority: Major
>
> Spark driver will listen to all pods events to manage its executor pods. This 
> will cause pressure on Kubernetes API server in a large cluster, because 
> there will be many drivers connect to the API server and watch for the pods.
>  
> An alternative is to have a separate service to listen and watch all pod 
> events. Then each Spark driver only connects to that service to get pod 
> events. This will reduce the load on Kubernetes API server.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45761) Upgrade `Volcano` to 1.8.1

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45761:
---
Labels: pull-request-available  (was: )

> Upgrade `Volcano` to 1.8.1
> --
>
> Key: SPARK-45761
> URL: https://issues.apache.org/jira/browse/SPARK-45761
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes, Project Infra
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45761) Upgrade `Volcano` to 1.8.1

2023-11-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45761:
-

 Summary: Upgrade `Volcano` to 1.8.1
 Key: SPARK-45761
 URL: https://issues.apache.org/jira/browse/SPARK-45761
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Kubernetes, Project Infra
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45728) Upgrade `kubernetes-client` to 6.9.1

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45728:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Bug)

> Upgrade `kubernetes-client` to 6.9.1
> 
>
> Key: SPARK-45728
> URL: https://issues.apache.org/jira/browse/SPARK-45728
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build, Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45760) Add With expression to avoid duplicating expressions

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45760:
---
Labels: pull-request-available  (was: )

> Add With expression to avoid duplicating expressions
> 
>
> Key: SPARK-45760
> URL: https://issues.apache.org/jira/browse/SPARK-45760
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45760) Add With expression to avoid duplicating expressions

2023-11-01 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-45760:
---

 Summary: Add With expression to avoid duplicating expressions
 Key: SPARK-45760
 URL: https://issues.apache.org/jira/browse/SPARK-45760
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45327) Upgrade zstd-jni to 1.5.5-6

2023-11-01 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-45327.
--
  Assignee: BingKun Pan
Resolution: Fixed

https://github.com/apache/spark/pull/43113

> Upgrade zstd-jni to 1.5.5-6
> ---
>
> Key: SPARK-45327
> URL: https://issues.apache.org/jira/browse/SPARK-45327
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45753) Support `spark.deploy.driverIdPattern`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45753.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43615
[https://github.com/apache/spark/pull/43615]

> Support `spark.deploy.driverIdPattern`
> --
>
> Key: SPARK-45753
> URL: https://issues.apache.org/jira/browse/SPARK-45753
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45502) Upgrade Kafka to 3.6.1

2023-11-01 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781789#comment-17781789
 ] 

Dongjoon Hyun commented on SPARK-45502:
---

KAFKA-7109 was the root cause of revert.

> Upgrade Kafka to 3.6.1
> --
>
> Key: SPARK-45502
> URL: https://issues.apache.org/jira/browse/SPARK-45502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Apache Kafka 3.6.0 is released on Oct 10, 2023.
> - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45502) Upgrade Kafka to 3.6.1

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45502:
--
Summary: Upgrade Kafka to 3.6.1  (was: Upgrade Kafka to 3.6.0)

> Upgrade Kafka to 3.6.1
> --
>
> Key: SPARK-45502
> URL: https://issues.apache.org/jira/browse/SPARK-45502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Apache Kafka 3.6.0 is released on Oct 10, 2023.
> - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45502) Upgrade Kafka to 3.6.0

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45502:
-

Assignee: (was: Deng Ziming)

> Upgrade Kafka to 3.6.0
> --
>
> Key: SPARK-45502
> URL: https://issues.apache.org/jira/browse/SPARK-45502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Apache Kafka 3.6.0 is released on Oct 10, 2023.
> - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45502) Upgrade Kafka to 3.6.0

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45502:
--
Fix Version/s: (was: 4.0.0)

> Upgrade Kafka to 3.6.0
> --
>
> Key: SPARK-45502
> URL: https://issues.apache.org/jira/browse/SPARK-45502
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>
> Apache Kafka 3.6.0 is released on Oct 10, 2023.
> - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45743) Upgrade dropwizard metrics 4.2.21

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45743:
-

Assignee: Yang Jie

> Upgrade dropwizard metrics 4.2.21
> -
>
> Key: SPARK-45743
> URL: https://issues.apache.org/jira/browse/SPARK-45743
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/dropwizard/metrics/releases/tag/v4.2.21]
> [https://github.com/dropwizard/metrics/releases/tag/v4.2.20]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45743) Upgrade dropwizard metrics 4.2.21

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45743.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43608
[https://github.com/apache/spark/pull/43608]

> Upgrade dropwizard metrics 4.2.21
> -
>
> Key: SPARK-45743
> URL: https://issues.apache.org/jira/browse/SPARK-45743
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [https://github.com/dropwizard/metrics/releases/tag/v4.2.21]
> [https://github.com/dropwizard/metrics/releases/tag/v4.2.20]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45743) Upgrade dropwizard metrics 4.2.21

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45743:
--
Parent: SPARK-44111
Issue Type: Sub-task  (was: Improvement)

> Upgrade dropwizard metrics 4.2.21
> -
>
> Key: SPARK-45743
> URL: https://issues.apache.org/jira/browse/SPARK-45743
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> [https://github.com/dropwizard/metrics/releases/tag/v4.2.21]
> [https://github.com/dropwizard/metrics/releases/tag/v4.2.20]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45759) Custom metrics should be updated after commit too

2023-11-01 Thread Ali Ince (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Ince updated SPARK-45759:
-
Description: 
We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in {{write(T record)}} implementation and sent 
to the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during {{commit()}} call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
{{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
and {{DataWriter}} implementations. The problem we see is, since 
{{CustomMetrics.updateMetrics}} is only called 
[during|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443]
 and [just 
after|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451]
 record processing, we do not observe the complete metrics since the last batch 
that is handled during {{commit()}} call is not collected/updated.

We propose to also to add {{CustomMetrics.updateMetrics}} call after 
{{commit()}} is processed successfully, ideally just before {{run}} function 
exits (maybe just above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).

  was:
We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in {{write(T record)}} implementation and sent 
to the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during {{commit()}} call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
{{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
and {{DataWriter}} implementations. The problem we see is, since 
{{CustomMetrics.updateMetrics}} is only called [during|#L443-L443] and [just 
after|#L451-L451] record processing, we do not observe the complete metrics 
since the last batch that is handled during {{commit()}} call is not 
collected/updated.

We propose to also to add {{CustomMetrics.updateMetrics}} call after 
{{commit()}} is processed successfully, ideally just before {{run}} function 
exits (maybe just above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).


> Custom metrics should be updated after commit too
> -
>
> Key: SPARK-45759
> URL: https://issues.apache.org/jira/browse/SPARK-45759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ali Ince
>Priority: Minor
>
> We have a DataWriter component, which processes records in configurable 
> batches, which are accumulated in {{write(T record)}} implementation and sent 
> to the persistent store when the configured batch size is reached. Within 
> this approach, last batch is handled during {{commit()}} call, as there is no 
> other mechanism of knowing if there are more records or not.
> We are now adding support for custom metrics, by implementing the 
> {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
> and {{DataWriter}} implementations. The problem we see is, since 
> {{CustomMetrics.updateMetrics}} is only called 
> [during|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443]
>  and [just 
> after|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451]
>  record processing, we do not observe the complete metrics since the last 
> batch that is handled during {{commit()}} call is not collected/updated.
> We propose to also to add {{CustomMetrics.updateMetrics}} call after 
> {{commit()}} is processed successfully, ideally just before {{run}} function 
> exits (maybe just above 
> [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

--

[jira] [Updated] (SPARK-45759) Custom metrics should be updated after commit too

2023-11-01 Thread Ali Ince (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Ince updated SPARK-45759:
-
Description: 
We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in {{write(T record)}} implementation and sent 
to the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during {{commit()}} call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
{{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
and {{DataWriter}} implementations. The problem we see is, since 
{{CustomMetrics.updateMetrics}} is only called [during|#L443-L443] and [just 
after|#L451-L451] record processing, we do not observe the complete metrics 
since the last batch that is handled during {{commit()}} call is not 
collected/updated.

We propose to also to add {{CustomMetrics.updateMetrics}} call after 
{{commit()}} is processed successfully, ideally just before {{run}} function 
exits (maybe just above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).

  was:
We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in {{write(T record)}} implementation and sent 
to the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during {{commit()}} call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
{{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
and {{DataWriter}} implementations. The problem we see is, since 
{{CustomMetrics.updateMetrics}} is only called [during|#L443-L443]] and [just 
after|#L451-L451] record processing, we do not observe the complete metrics 
since the last batch that is handled during {{commit()}} call is not 
collected/updated.

We propose to also to add {{CustomMetrics.updateMetrics}} call after 
{{commit()}} is processed successfully, ideally just before {{run}} function 
exits (maybe just above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).


> Custom metrics should be updated after commit too
> -
>
> Key: SPARK-45759
> URL: https://issues.apache.org/jira/browse/SPARK-45759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ali Ince
>Priority: Minor
>
> We have a DataWriter component, which processes records in configurable 
> batches, which are accumulated in {{write(T record)}} implementation and sent 
> to the persistent store when the configured batch size is reached. Within 
> this approach, last batch is handled during {{commit()}} call, as there is no 
> other mechanism of knowing if there are more records or not.
> We are now adding support for custom metrics, by implementing the 
> {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
> and {{DataWriter}} implementations. The problem we see is, since 
> {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443] and [just 
> after|#L451-L451] record processing, we do not observe the complete metrics 
> since the last batch that is handled during {{commit()}} call is not 
> collected/updated.
> We propose to also to add {{CustomMetrics.updateMetrics}} call after 
> {{commit()}} is processed successfully, ideally just before {{run}} function 
> exits (maybe just above 
> [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45759) Custom metrics should be updated after commit too

2023-11-01 Thread Ali Ince (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ali Ince updated SPARK-45759:
-
Description: 
We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in {{write(T record)}} implementation and sent 
to the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during {{commit()}} call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
{{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
and {{DataWriter}} implementations. The problem we see is, since 
{{CustomMetrics.updateMetrics}} is only called [during|#L443-L443]] and [just 
after|#L451-L451] record processing, we do not observe the complete metrics 
since the last batch that is handled during {{commit()}} call is not 
collected/updated.

We propose to also to add {{CustomMetrics.updateMetrics}} call after 
{{commit()}} is processed successfully, ideally just before {{run}} function 
exits (maybe just above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).

  was:
We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in `write(T record)` implementation and sent to 
the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during `commit()` call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
`supportedCustomMetrics()` and `currentMetricsValues()` in the `Write` and 
`DataWriter` implementations. The problem we see is, since 
`CustomMetrics.updateMetrics` is only called 
[during|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443]]
 and [just 
after|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451)]]
 record processing, we do not observe the complete metrics since the last batch 
that is handled during `commit()` call is not collected/updated.

We propose to also to add `CustomMetrics.updateMetrics` call after `commit()` 
is processed successfully, ideally just before `run` function exits (maybe just 
above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).


> Custom metrics should be updated after commit too
> -
>
> Key: SPARK-45759
> URL: https://issues.apache.org/jira/browse/SPARK-45759
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Ali Ince
>Priority: Minor
>
> We have a DataWriter component, which processes records in configurable 
> batches, which are accumulated in {{write(T record)}} implementation and sent 
> to the persistent store when the configured batch size is reached. Within 
> this approach, last batch is handled during {{commit()}} call, as there is no 
> other mechanism of knowing if there are more records or not.
> We are now adding support for custom metrics, by implementing the 
> {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} 
> and {{DataWriter}} implementations. The problem we see is, since 
> {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443]] and [just 
> after|#L451-L451] record processing, we do not observe the complete metrics 
> since the last batch that is handled during {{commit()}} call is not 
> collected/updated.
> We propose to also to add {{CustomMetrics.updateMetrics}} call after 
> {{commit()}} is processed successfully, ideally just before {{run}} function 
> exits (maybe just above 
> [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45758:
---
Labels: pull-request-available  (was: )

> Introduce a mapper for hadoop compression codecs
> 
>
> Key: SPARK-45758
> URL: https://issues.apache.org/jira/browse/SPARK-45758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two fake compression codecs none and 
> uncompress.
> There are a lot of magic strings copy from Hadoop compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45759) Custom metrics should be updated after commit too

2023-11-01 Thread Ali Ince (Jira)
Ali Ince created SPARK-45759:


 Summary: Custom metrics should be updated after commit too
 Key: SPARK-45759
 URL: https://issues.apache.org/jira/browse/SPARK-45759
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Ali Ince


We have a DataWriter component, which processes records in configurable 
batches, which are accumulated in `write(T record)` implementation and sent to 
the persistent store when the configured batch size is reached. Within this 
approach, last batch is handled during `commit()` call, as there is no other 
mechanism of knowing if there are more records or not.

We are now adding support for custom metrics, by implementing the 
`supportedCustomMetrics()` and `currentMetricsValues()` in the `Write` and 
`DataWriter` implementations. The problem we see is, since 
`CustomMetrics.updateMetrics` is only called 
[during|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443]]
 and [just 
after|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451)]]
 record processing, we do not observe the complete metrics since the last batch 
that is handled during `commit()` call is not collected/updated.

We propose to also to add `CustomMetrics.updateMetrics` call after `commit()` 
is processed successfully, ideally just before `run` function exits (maybe just 
above 
[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-01 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng updated SPARK-45758:
---
Description: 
Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two fake compression codecs none and uncompress.

There are a lot of magic strings copy from Hadoop compression codecs. This 
issue lead to developers need to manually maintain its consistency. It is easy 
to make mistakes and reduce development efficiency.

> Introduce a mapper for hadoop compression codecs
> 
>
> Key: SPARK-45758
> URL: https://issues.apache.org/jira/browse/SPARK-45758
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>
> Currently, Spark supported partial Hadoop compression codecs, but the Hadoop 
> supported compression codecs and spark supported are not completely 
> one-on-one due to Spark introduce two fake compression codecs none and 
> uncompress.
> There are a lot of magic strings copy from Hadoop compression codecs. This 
> issue lead to developers need to manually maintain its consistency. It is 
> easy to make mistakes and reduce development efficiency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45758) Introduce a mapper for hadoop compression codecs

2023-11-01 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45758:
--

 Summary: Introduce a mapper for hadoop compression codecs
 Key: SPARK-45758
 URL: https://issues.apache.org/jira/browse/SPARK-45758
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng
Assignee: Jiaan Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45755) Push down limit through Dataset.isEmpty()

2023-11-01 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng reassigned SPARK-45755:
--

Assignee: Yuming Wang

> Push down limit through Dataset.isEmpty()
> -
>
> Key: SPARK-45755
> URL: https://issues.apache.org/jira/browse/SPARK-45755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
>
> Push down LocalLimit can not optimize the case of distinct.
> {code:scala}
>   def isEmpty: Boolean = withAction("isEmpty",
> withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) 
> }.queryExecution) { plan =>
> plan.executeTake(1).isEmpty
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45755) Push down limit through Dataset.isEmpty()

2023-11-01 Thread Jiaan Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jiaan Geng resolved SPARK-45755.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43617
[https://github.com/apache/spark/pull/43617]

> Push down limit through Dataset.isEmpty()
> -
>
> Key: SPARK-45755
> URL: https://issues.apache.org/jira/browse/SPARK-45755
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Push down LocalLimit can not optimize the case of distinct.
> {code:scala}
>   def isEmpty: Boolean = withAction("isEmpty",
> withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) 
> }.queryExecution) { plan =>
> plan.executeTake(1).isEmpty
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44896) Consider adding information os_prio, cpu, elapsed, tid, nid, etc., from the jstack tool

2023-11-01 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781694#comment-17781694
 ] 

Kent Yao commented on SPARK-44896:
--

Hi [~hannahkamundson],

 

Sure, feel free to send a PR for this issue.

 

Speaking of your project, there is also an ongoing contribution

program launched by the apache kyuubi community. See

[https://github.com/orgs/apache/projects/296?pane=info]

 

Thank you

Kent

 

> Consider adding information os_prio, cpu, elapsed, tid, nid, etc.,  from the 
> jstack tool
> 
>
> Key: SPARK-44896
> URL: https://issues.apache.org/jira/browse/SPARK-44896
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect

2023-11-01 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-45751.
--
Fix Version/s: 3.3.4
   3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 43618
[https://github.com/apache/spark/pull/43618]

> The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the 
> official website is incorrect
> 
>
> Key: SPARK-45751
> URL: https://issues.apache.org/jira/browse/SPARK-45751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, UI
>Affects Versions: 3.5.0
>Reporter: chenyu
>Assignee: chenyu
>Priority: Trivial
>  Labels: pull-request-available
> Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2
>
> Attachments: the default value.png, the value on the website.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45680) ReleaseSession to close Spark Connect session

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45680:
--

Assignee: (was: Apache Spark)

> ReleaseSession to close Spark Connect session
> -
>
> Key: SPARK-45680
> URL: https://issues.apache.org/jira/browse/SPARK-45680
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: Juliusz Sompolski
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect

2023-11-01 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-45751:


Assignee: chenyu

> The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the 
> official website is incorrect
> 
>
> Key: SPARK-45751
> URL: https://issues.apache.org/jira/browse/SPARK-45751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, UI
>Affects Versions: 3.5.0
>Reporter: chenyu
>Assignee: chenyu
>Priority: Trivial
>  Labels: pull-request-available
> Attachments: the default value.png, the value on the website.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45751:
--

Assignee: (was: Apache Spark)

> The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the 
> official website is incorrect
> 
>
> Key: SPARK-45751
> URL: https://issues.apache.org/jira/browse/SPARK-45751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, UI
>Affects Versions: 3.5.0
>Reporter: chenyu
>Priority: Trivial
>  Labels: pull-request-available
> Attachments: the default value.png, the value on the website.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45751:
--

Assignee: Apache Spark

> The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the 
> official website is incorrect
> 
>
> Key: SPARK-45751
> URL: https://issues.apache.org/jira/browse/SPARK-45751
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, UI
>Affects Versions: 3.5.0
>Reporter: chenyu
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: pull-request-available
> Attachments: the default value.png, the value on the website.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45022) Provide context for dataset API errors

2023-11-01 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45022:


Assignee: Max Gekk

> Provide context for dataset API errors
> --
>
> Key: SPARK-45022
> URL: https://issues.apache.org/jira/browse/SPARK-45022
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
>
> SQL failures already provide nice error context when there is a failure:
> {noformat}
> org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. 
> Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
> necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> == SQL(line 1, position 1) ==
> a / b
> ^
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
> ...
> {noformat}
> We could add a similar user friendly error context to Dataset APIs.
> E.g. consider the following Spark app SimpleApp.scala:
> {noformat}
>1  import org.apache.spark.sql.SparkSession
>2  import org.apache.spark.sql.functions._
>3
>4  object SimpleApp {
>5def main(args: Array[String]) {
>6  val spark = SparkSession.builder.appName("Simple 
> Application").config("spark.sql.ansi.enabled", true).getOrCreate()
>7  import spark.implicits._
>8
>9  val c = col("a") / col("b")
>   10
>   11  Seq((1, 0)).toDF("a", "b").select(c).show()
>   12
>   13  spark.stop()
>   14}
>   15  }
> {noformat}
> then the error context could be:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkArithmeticException: 
> [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 
> 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
> "false" to bypass this error.
> == Dataset ==
> "div" was called from SimpleApp$.main(SimpleApp.scala:9)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45022) Provide context for dataset API errors

2023-11-01 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45022.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43334
[https://github.com/apache/spark/pull/43334]

> Provide context for dataset API errors
> --
>
> Key: SPARK-45022
> URL: https://issues.apache.org/jira/browse/SPARK-45022
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Peter Toth
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> SQL failures already provide nice error context when there is a failure:
> {noformat}
> org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. 
> Use `try_divide` to tolerate divisor being 0 and return NULL instead. If 
> necessary set "spark.sql.ansi.enabled" to "false" to bypass this error.
> == SQL(line 1, position 1) ==
> a / b
> ^
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala)
> ...
> {noformat}
> We could add a similar user friendly error context to Dataset APIs.
> E.g. consider the following Spark app SimpleApp.scala:
> {noformat}
>1  import org.apache.spark.sql.SparkSession
>2  import org.apache.spark.sql.functions._
>3
>4  object SimpleApp {
>5def main(args: Array[String]) {
>6  val spark = SparkSession.builder.appName("Simple 
> Application").config("spark.sql.ansi.enabled", true).getOrCreate()
>7  import spark.implicits._
>8
>9  val c = col("a") / col("b")
>   10
>   11  Seq((1, 0)).toDF("a", "b").select(c).show()
>   12
>   13  spark.stop()
>   14}
>   15  }
> {noformat}
> then the error context could be:
> {noformat}
> Exception in thread "main" org.apache.spark.SparkArithmeticException: 
> [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 
> 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to 
> "false" to bypass this error.
> == Dataset ==
> "div" was called from SimpleApp$.main(SimpleApp.scala:9)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201)
>   at 
> org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45174) Support `spark.deploy.maxDrivers`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45174:
--
Summary: Support `spark.deploy.maxDrivers`  (was: Support 
spark.deploy.maxDrivers)

> Support `spark.deploy.maxDrivers`
> -
>
> Key: SPARK-45174
> URL: https://issues.apache.org/jira/browse/SPARK-45174
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Like `spark.mesos.maxDrivers`, this issue aims to add 
> `spark.deploy.maxDrivers`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45497) Add a symbolic link file `spark-examples.jar` in K8s Docker images

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45497:
--
Parent: (was: SPARK-45756)
Issue Type: Improvement  (was: Sub-task)

> Add a symbolic link file `spark-examples.jar` in K8s Docker images
> --
>
> Key: SPARK-45497
> URL: https://issues.apache.org/jira/browse/SPARK-45497
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45497) Add a symbolic link file `spark-examples.jar` in K8s Docker images

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45497:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Add a symbolic link file `spark-examples.jar` in K8s Docker images
> --
>
> Key: SPARK-45497
> URL: https://issues.apache.org/jira/browse/SPARK-45497
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44214) Support Spark Driver Live Log UI

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44214:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Support Spark Driver Live Log UI
> 
>
> Key: SPARK-44214
> URL: https://issues.apache.org/jira/browse/SPARK-44214
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45756) Revisit and Improve Spark Standalone Cluster

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45756:
--
Labels: releasenotes  (was: )

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: releasenotes
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45754) Support `spark.deploy.appIdPattern`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45754:
-

Assignee: Dongjoon Hyun

> Support `spark.deploy.appIdPattern`
> ---
>
> Key: SPARK-45754
> URL: https://issues.apache.org/jira/browse/SPARK-45754
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45753) Support `spark.deploy.driverIdPattern`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45753:
-

Assignee: Dongjoon Hyun

> Support `spark.deploy.driverIdPattern`
> --
>
> Key: SPARK-45753
> URL: https://issues.apache.org/jira/browse/SPARK-45753
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45757) Avoid re-computation of NNZ in Binarizer

2023-11-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45757:
---
Labels: pull-request-available  (was: )

> Avoid re-computation of NNZ in Binarizer
> 
>
> Key: SPARK-45757
> URL: https://issues.apache.org/jira/browse/SPARK-45757
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45753) Support `spark.deploy.driverIdPattern`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45753:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Support `spark.deploy.driverIdPattern`
> --
>
> Key: SPARK-45753
> URL: https://issues.apache.org/jira/browse/SPARK-45753
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45756) Revisit and Improve Spark Standalone Cluster

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45756:
--
Summary: Revisit and Improve Spark Standalone Cluster  (was: Improve Spark 
Standalone Cluster)

> Revisit and Improve Spark Standalone Cluster
> 
>
> Key: SPARK-45756
> URL: https://issues.apache.org/jira/browse/SPARK-45756
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45754) Support `spark.deploy.appIdPattern`

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45754:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Support `spark.deploy.appIdPattern`
> ---
>
> Key: SPARK-45754
> URL: https://issues.apache.org/jira/browse/SPARK-45754
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45749) Fix Spark History Server to sort `Duration` column properly

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45749:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Bug)

> Fix Spark History Server to sort `Duration` column properly
> ---
>
> Key: SPARK-45749
> URL: https://issues.apache.org/jira/browse/SPARK-45749
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 3.2.0, 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45757) Avoid re-computation of NNZ in Binarizer

2023-11-01 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-45757:
-

 Summary: Avoid re-computation of NNZ in Binarizer
 Key: SPARK-45757
 URL: https://issues.apache.org/jira/browse/SPARK-45757
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 4.0.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45500) Show the number of abnormally completed drivers in MasterPage

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45500:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Show the number of abnormally completed drivers in MasterPage
> -
>
> Key: SPARK-45500
> URL: https://issues.apache.org/jira/browse/SPARK-45500
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45474) Support top-level filtering in MasterPage JSON API

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45474:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Support top-level filtering in MasterPage JSON API
> --
>
> Key: SPARK-45474
> URL: https://issues.apache.org/jira/browse/SPARK-45474
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45197) Make StandaloneRestServer add JavaModuleOptions to drivers

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45197:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Bug)

> Make StandaloneRestServer add JavaModuleOptions to drivers
> --
>
> Key: SPARK-45197
> URL: https://issues.apache.org/jira/browse/SPARK-45197
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45187) Fix WorkerPage to use the same pattern for `logPage` urls

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45187:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Bug)

> Fix WorkerPage to use the same pattern for `logPage` urls
> -
>
> Key: SPARK-45187
> URL: https://issues.apache.org/jira/browse/SPARK-45187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45197) Make StandaloneRestServer add JavaModuleOptions to drivers

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45197:
--
Parent: (was: SPARK-43831)
Issue Type: Bug  (was: Sub-task)

> Make StandaloneRestServer add JavaModuleOptions to drivers
> --
>
> Key: SPARK-45197
> URL: https://issues.apache.org/jira/browse/SPARK-45197
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45174) Support spark.deploy.maxDrivers

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45174:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Improvement)

> Support spark.deploy.maxDrivers
> ---
>
> Key: SPARK-45174
> URL: https://issues.apache.org/jira/browse/SPARK-45174
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Like `spark.mesos.maxDrivers`, this issue aims to add 
> `spark.deploy.maxDrivers`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons

2023-11-01 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44857:
--
Parent: SPARK-45756
Issue Type: Sub-task  (was: Bug)

> Fix getBaseURI error in Spark Worker LogPage UI buttons
> ---
>
> Key: SPARK-44857
> URL: https://issues.apache.org/jira/browse/SPARK-44857
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, Web UI
>Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
> Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45756) Improve Spark Standalone Cluster

2023-11-01 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45756:
-

 Summary: Improve Spark Standalone Cluster
 Key: SPARK-45756
 URL: https://issues.apache.org/jira/browse/SPARK-45756
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org