[jira] [Updated] (SPARK-45767) Delete `TimeStampedHashMap` and its UT
[ https://issues.apache.org/jira/browse/SPARK-45767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45767: --- Labels: pull-request-available (was: ) > Delete `TimeStampedHashMap` and its UT > -- > > Key: SPARK-45767 > URL: https://issues.apache.org/jira/browse/SPARK-45767 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Trivial > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45767) Delete `TimeStampedHashMap` and its UT
BingKun Pan created SPARK-45767: --- Summary: Delete `TimeStampedHashMap` and its UT Key: SPARK-45767 URL: https://issues.apache.org/jira/browse/SPARK-45767 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes
[ https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781976#comment-17781976 ] Asif commented on SPARK-36786: -- I had put this on back burner as my changes were on 3.2, so I have to do a merge . on latest. Though whatever optimizations I did on 3.2 are still applicable as the drawback still exist. But chnages are going to be a a little extensive. If there is interest on it I can pick up , after some days as right now occupied with another spip which proposes chnages for improving perf of broadcast hash joins on non partition column joins. > SPIP: Improving the compile time performance, by improving a couple of > rules, from 24 hrs to under 8 minutes > - > > Key: SPARK-36786 > URL: https://issues.apache.org/jira/browse/SPARK-36786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1, 3.1.2 >Reporter: Asif >Priority: Major > Labels: SPIP > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > The aim is to improve the compile time performance of query which in > WorkDay's use case takes > 24 hrs ( & eventually fails) , to < 8 min. > To explain the problem, I will provide the context. > The query plan in our production system, is huge, with nested *case when* > expressions ( level of nesting could be > 8) , where each *case when* can > have branches sometimes > 1000. > The plan could look like > {quote}Project1 > | > Filter 1 > | > Project2 > | > Filter2 > | > Project3 > | > Filter3 > | > Join > {quote} > Now the optimizer has a Batch of Rules , intended to run at max 100 times. > *Also note that the, the batch will continue to run till one of the condition > is satisfied* > *i.e either numIter == 100 || inputPlan == outputPlan (idempotency is > achieved)* > One of the early Rule is *PushDownPredicateRule.* > **Followed by **CollapseProject**. > > The first issue is *PushDownPredicate* rule. > It picks one filter at a time & pushes it at lowest level ( I understand > that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but > either case it picks 1 filter at time starting from top, in each iteration. > *The above comment is no longer true in 3.1 release as it now combines > filters. so it does push now all the encountered filters in a single pass. > But it still materializes the filter on each push by realiasing.* > So if there are say 50 projects interspersed with Filters , the idempotency > is guaranteedly not going to get achieved till around 49 iterations. > Moreover, CollapseProject will also be modifying tree on each iteration as a > filter will get removed within Project. > Moreover, on each movement of filter through project tree, the filter is > re-aliased using transformUp rule. transformUp is very expensive compared to > transformDown. As the filter keeps getting pushed down , its size increases. > To optimize this rule , 2 things are needed > # Instead of pushing one filter at a time, collect all the filters as we > traverse the tree in that iteration itself. > # Do not re-alias the filters on each push. Collect the sequence of projects > it has passed through, and when the filters have reached their resting > place, do the re-alias by processing the projects collected in down to up > manner. > This will result in achieving idempotency in a couple of iterations. > *How reducing the number of iterations help in performance* > There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals > ( ... there are around 6 more such rules)* which traverse the tree using > transformUp, and they run unnecessarily in each iteration , even when the > expressions in an operator have not changed since the previous runs. > *I have a different proposal which I will share later, as to how to avoid the > above rules from running unnecessarily, if it can be guaranteed that the > expression is not going to mutate in the operator.* > The cause of our huge compilation time has been identified as the above. > > h2. Q2. What problem is this proposal NOT designed to solve? > It is not going to change any runtime profile. > h2. Q3. How is it done today, and what are the limits of current practice? > Like mentioned above , currently PushDownPredicate pushes one filter at a > time & at each Project , it materialized the re-aliased filter. This > results in large number of iterations to achieve idempotency as well as > immediate materialization of Filter after each Project pass,, results in > unnecessary tree traversals of filter expression that too using transformUp. > and the expression tree of filter i
[jira] [Commented] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes
[ https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781966#comment-17781966 ] Abhinav Kumar commented on SPARK-36786: --- [~ashahid7] [~adou...@sqli.com] where are we on this one? > SPIP: Improving the compile time performance, by improving a couple of > rules, from 24 hrs to under 8 minutes > - > > Key: SPARK-36786 > URL: https://issues.apache.org/jira/browse/SPARK-36786 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.1, 3.1.2 >Reporter: Asif >Priority: Major > Labels: SPIP > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > The aim is to improve the compile time performance of query which in > WorkDay's use case takes > 24 hrs ( & eventually fails) , to < 8 min. > To explain the problem, I will provide the context. > The query plan in our production system, is huge, with nested *case when* > expressions ( level of nesting could be > 8) , where each *case when* can > have branches sometimes > 1000. > The plan could look like > {quote}Project1 > | > Filter 1 > | > Project2 > | > Filter2 > | > Project3 > | > Filter3 > | > Join > {quote} > Now the optimizer has a Batch of Rules , intended to run at max 100 times. > *Also note that the, the batch will continue to run till one of the condition > is satisfied* > *i.e either numIter == 100 || inputPlan == outputPlan (idempotency is > achieved)* > One of the early Rule is *PushDownPredicateRule.* > **Followed by **CollapseProject**. > > The first issue is *PushDownPredicate* rule. > It picks one filter at a time & pushes it at lowest level ( I understand > that in 3.1 it pushes through join, while in 2.4 it stops at Join) , but > either case it picks 1 filter at time starting from top, in each iteration. > *The above comment is no longer true in 3.1 release as it now combines > filters. so it does push now all the encountered filters in a single pass. > But it still materializes the filter on each push by realiasing.* > So if there are say 50 projects interspersed with Filters , the idempotency > is guaranteedly not going to get achieved till around 49 iterations. > Moreover, CollapseProject will also be modifying tree on each iteration as a > filter will get removed within Project. > Moreover, on each movement of filter through project tree, the filter is > re-aliased using transformUp rule. transformUp is very expensive compared to > transformDown. As the filter keeps getting pushed down , its size increases. > To optimize this rule , 2 things are needed > # Instead of pushing one filter at a time, collect all the filters as we > traverse the tree in that iteration itself. > # Do not re-alias the filters on each push. Collect the sequence of projects > it has passed through, and when the filters have reached their resting > place, do the re-alias by processing the projects collected in down to up > manner. > This will result in achieving idempotency in a couple of iterations. > *How reducing the number of iterations help in performance* > There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals > ( ... there are around 6 more such rules)* which traverse the tree using > transformUp, and they run unnecessarily in each iteration , even when the > expressions in an operator have not changed since the previous runs. > *I have a different proposal which I will share later, as to how to avoid the > above rules from running unnecessarily, if it can be guaranteed that the > expression is not going to mutate in the operator.* > The cause of our huge compilation time has been identified as the above. > > h2. Q2. What problem is this proposal NOT designed to solve? > It is not going to change any runtime profile. > h2. Q3. How is it done today, and what are the limits of current practice? > Like mentioned above , currently PushDownPredicate pushes one filter at a > time & at each Project , it materialized the re-aliased filter. This > results in large number of iterations to achieve idempotency as well as > immediate materialization of Filter after each Project pass,, results in > unnecessary tree traversals of filter expression that too using transformUp. > and the expression tree of filter is bound to keep increasing as it is pushed > down. > h2. Q4. What is new in your approach and why do you think it will be > successful? > In the new approach we push all the filters down in a single pass. And do not > materialize filters as it pass through Project. Instead keep collecting > projects in sequential order and materialize the final filter once i
[jira] [Commented] (SPARK-33164) SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to DataSet.dropColumn(someColumn)
[ https://issues.apache.org/jira/browse/SPARK-33164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781959#comment-17781959 ] Abhinav Kumar commented on SPARK-33164: --- I see value in some use cases like [~arnaud.nauwynck] mentions. But there is this "SELECT *" very well documented risks, leading to maintainability issues. Should we still be trying to implement this? > SPIP: add SQL support to "SELECT * (EXCEPT someColumn) FROM .." equivalent to > DataSet.dropColumn(someColumn) > > > Key: SPARK-33164 > URL: https://issues.apache.org/jira/browse/SPARK-33164 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.5, 2.4.6, 2.4.7, 3.0.0, 3.0.1 >Reporter: Arnaud Nauwynck >Priority: Minor > Original Estimate: 120h > Remaining Estimate: 120h > > *Q1.* What are you trying to do? Articulate your objectives using absolutely > no jargon. > I would like to have the extended SQL syntax "SELECT * EXCEPT someColumn FROM > .." > to be able to select all columns except some in a SELECT clause. > It would be similar to SQL syntax from some databases, like Google BigQuery > or PostgresQL. > https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax > Google question "select * EXCEPT one column", and you will see many > developpers have the same problems. > example posts: > https://blog.jooq.org/2018/05/14/selecting-all-columns-except-one-in-postgresql/ > https://www.thetopsites.net/article/53001825.shtml > There are several typicall examples where is is very helpfull : > use-case1: > you add "count ( * ) countCol" column, and then filter on it using for > example "having countCol = 1" > ... and then you want to select all columns EXCEPT this dummy column which > always is "1" > {noformat} > select * (EXCEPT countCol) > from ( > select count(*) countCol, * >from MyTable >where ... >group by ... having countCol = 1 > ) > {noformat} > > use-case 2: > same with analytical function "partition over(...) rankCol ... where > rankCol=1" > For example to get the latest row before a given time, in a time series > table. > This is "Time-Travel" queries addressed by framework like "DeltaLake" > {noformat} > CREATE table t_updates (update_time timestamp, id string, col1 type1, col2 > type2, ... col42) > pastTime=.. > SELECT * (except rankCol) > FROM ( >SELECT *, > RANK() OVER (PARTITION BY id ORDER BY update_time) rankCol >FROM t_updates >where update_time < pastTime > ) WHERE rankCol = 1 > > {noformat} > > use-case 3: > copy some data from table "t" to corresponding table "t_snapshot", and back > to "t" > {noformat} >CREATE TABLE t (col1 type1, col2 type2, col3 type3, ... col42 type42) ... > >/* create corresponding table: (snap_id string, col1 type1, col2 type2, > col3 type3, ... col42 type42) */ >CREATE TABLE t_snapshot >AS SELECT '' as snap_id, * FROM t WHERE 1=2 >/* insert data from t to some snapshot */ >INSERT INTO t_snapshot >SELECT 'snap1' as snap_id, * from t > >/* select some data from snapshot table (without snap_id column) .. */ >SELECT * (EXCEPT snap_id) FROM t_snapshot where snap_id='snap1' > > {noformat} > > > *Q2.* What problem is this proposal NOT designed to solve? > It is only a SQL syntaxic sugar. > It does not change SQL execution plan or anything complex. > *Q3.* How is it done today, and what are the limits of current practice? > > Today, you can either use the DataSet API, with .dropColumn(someColumn) > or you need to HARD-CODE manually all columns in your SQL. Therefore your > code is NOT generic (or you are using a SQL meta-code generator?) > *Q4.* What is new in your approach and why do you think it will be successful? > It is NOT new... it is already a proven solution from DataSet.dropColumn(), > Postgresql, BigQuery > > *Q5.* Who cares? If you are successful, what difference will it make? > It simplifies life of developpers, dba, data analysts, end users. > It simplify development of SQL code, in a more generic way for many tasks. > *Q6.* What are the risks? > There is VERY limited risk on spark SQL, because it already exists in DataSet > API. > It is an extension of SQL syntax, so the risk is annoying some IDE SQL > editors for a new SQL syntax. > *Q7.* How long will it take? > No idea. I guess someone experienced in the Spark SQL internals might do it > relatively "quickly". > It is a kind of syntaxic sugar to add in antlr grammar rule, then transform > in DataSet api > *Q8.* What are the mid-term and final “exams” to check for success? > The 3 standard use-cases given in question Q1. -- This mes
[jira] [Resolved] (SPARK-45761) Upgrade `Volcano` to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45761. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43624 [https://github.com/apache/spark/pull/43624] > Upgrade `Volcano` to 1.8.1 > -- > > Key: SPARK-45761 > URL: https://issues.apache.org/jira/browse/SPARK-45761 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes, Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > To bring the latest feature and bug fixes in addition to the test coverage > for Volcano scheduler 1.8.1. > [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1] > > [https://github.com/volcano-sh/volcano/pull/3101 > |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 > volcano-sh/volcano#3101) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44419) Support to extract partial filters of datasource v2 table and push them down
[ https://issues.apache.org/jira/browse/SPARK-44419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44419: --- Labels: pull-request-available (was: ) > Support to extract partial filters of datasource v2 table and push them down > > > Key: SPARK-44419 > URL: https://issues.apache.org/jira/browse/SPARK-44419 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0 >Reporter: caican >Priority: Major > Labels: pull-request-available > > > Run the following sql, and the date predicate in the where clause is not > pushed down and it would cause a full table scan. > > {code:java} > SELECT > id, > data, > date > FROM > testcat.db.table > where > (date = 20221110 and udfStrLen(data) = 8) > or > (date = 2022 and udfStrLen(data) = 8) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44426) optimize adaptive skew join for ExistenceJoin
[ https://issues.apache.org/jira/browse/SPARK-44426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44426: --- Labels: pull-request-available (was: ) > optimize adaptive skew join for ExistenceJoin > - > > Key: SPARK-44426 > URL: https://issues.apache.org/jira/browse/SPARK-44426 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.2, 3.2.0, 3.3.0, 3.4.0 >Reporter: caican >Priority: Major > Labels: pull-request-available > > For this query, InSubQuery would be cast to ExistenceJoin and now > ExistenceJoin does not support automatic data skew for the left table. > {code:java} > SELECT * FROM skewData1 > where > (key1 in (select key2 from skewData2) > or value1 in (select value2 from skewData2){code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45680) ReleaseSession to close Spark Connect session
[ https://issues.apache.org/jira/browse/SPARK-45680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45680: Assignee: Juliusz Sompolski > ReleaseSession to close Spark Connect session > - > > Key: SPARK-45680 > URL: https://issues.apache.org/jira/browse/SPARK-45680 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45680) ReleaseSession to close Spark Connect session
[ https://issues.apache.org/jira/browse/SPARK-45680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45680. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43546 [https://github.com/apache/spark/pull/43546] > ReleaseSession to close Spark Connect session > - > > Key: SPARK-45680 > URL: https://issues.apache.org/jira/browse/SPARK-45680 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Juliusz Sompolski >Assignee: Juliusz Sompolski >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45766) ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions.
[ https://issues.apache.org/jira/browse/SPARK-45766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Szul updated SPARK-45766: --- Description: We have a custom encoder for union like objects. The our custom serializer uses an expression like: {{If(IsNull(If(.)), Literal(null), NamedStruct()))}} Using this encoder in a SQL expression that applies the `org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning` rule results in the exception below. It's because the expression it transformed by `PushFoldableIntoBranches' rule prior to `ObjectSerializerPruning`, which changes the expression to: {{If(If(.), Literal(null), NamedStruct()))}} which no longer matches the expressions for which null type alignment is performed. See the attached scala repl code for the demonstration of this issue. The exception: java.lang.IllegalArgumentException: requirement failed: All input types must be the same except nullable, containsNull, valueContainsNull flags. The expression is: if (if (assertnotnull(input[0, UnionType, true]).hasValue) isnull(assertnotnull(input[0, UnionType, true]).value) else true) null else named_struct(given, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, UnionType, true])).value).given, true, false, true)). The input types found are StructType(StructField(given,StringType,true),StructField(family,StringType,true)) StructType(StructField(given,StringType,true)). at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1304) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1297) at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1309) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1308) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1313) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1313) at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:166) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.pruneSerializer(objects.scala:209) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.$anonfun$applyOrElse$3(objects.scala:230) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:229) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:217) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformWithPruning(TreeNode.scala:427) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:217) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.apply(objects.scala:125) at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:222) was: We have a custom encoder fo
[jira] [Created] (SPARK-45766) ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions.
Piotr Szul created SPARK-45766: -- Summary: ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions. Key: SPARK-45766 URL: https://issues.apache.org/jira/browse/SPARK-45766 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.0, 3.4.1, 3.3.3 Reporter: Piotr Szul Attachments: prunning_bug.scala We have a custom encoder for union like objects. The our custom serializer uses an expression like: {{If(IsNull(If(.)), Literal(null), NamedStruct()))}} Using this encoder in a SQL expression that applies the `org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning` rule results in the exception below. It's because the expression it transformed by `PushFoldableIntoBranches' rule prior to `ObjectSerializerPruning`, which changes the expression to: {{If(If(.), Literal(null), NamedStruct()))}} which no longer matches the expression for which null type alignment is performed. See the attached scala repl code for the demonstration of this issue. The exception: java.lang.IllegalArgumentException: requirement failed: All input types must be the same except nullable, containsNull, valueContainsNull flags. The expression is: if (if (assertnotnull(input[0, UnionType, true]).hasValue) isnull(assertnotnull(input[0, UnionType, true]).value) else true) null else named_struct(given, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(knownnotnull(assertnotnull(input[0, UnionType, true])).value).given, true, false, true)). The input types found are StructType(StructField(given,StringType,true),StructField(family,StringType,true)) StructType(StructField(given,StringType,true)). at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1304) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1297) at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1309) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1308) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1313) at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1313) at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:41) at org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:166) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.pruneSerializer(objects.scala:209) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.$anonfun$applyOrElse$3(objects.scala:230) at scala.collection.immutable.List.map(List.scala:293) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:229) at org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:217) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformWithPruning(TreeNode.scala:427) at org.apache.spark.sql.catalyst.optimizer.ObjectSeria
[jira] [Updated] (SPARK-45766) ObjectSerializerPruning fails to align null types in custom serializer 'If' expressions.
[ https://issues.apache.org/jira/browse/SPARK-45766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Piotr Szul updated SPARK-45766: --- Attachment: prunning_bug.scala > ObjectSerializerPruning fails to align null types in custom serializer 'If' > expressions. > > > Key: SPARK-45766 > URL: https://issues.apache.org/jira/browse/SPARK-45766 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Piotr Szul >Priority: Minor > Attachments: prunning_bug.scala > > > We have a custom encoder for union like objects. > The our custom serializer uses an expression like: > {{If(IsNull(If(.)), Literal(null), NamedStruct()))}} > Using this encoder in a SQL expression that applies the > `org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning` rule > results in the exception below. > It's because the expression it transformed by `PushFoldableIntoBranches' rule > prior to `ObjectSerializerPruning`, which changes the expression to: > {{If(If(.), Literal(null), NamedStruct()))}} > which no longer matches the expression for which null type alignment is > performed. > See the attached scala repl code for the demonstration of this issue. > > The exception: > > java.lang.IllegalArgumentException: requirement failed: All input types must > be the same except nullable, containsNull, valueContainsNull flags. The > expression is: if (if (assertnotnull(input[0, UnionType, true]).hasValue) > isnull(assertnotnull(input[0, UnionType, true]).value) else true) null else > named_struct(given, staticinvoke(class > org.apache.spark.unsafe.types.UTF8String, StringType, fromString, > knownnotnull(knownnotnull(assertnotnull(input[0, UnionType, > true])).value).given, true, false, true)). The input types found are > > StructType(StructField(given,StringType,true),StructField(family,StringType,true)) > StructType(StructField(given,StringType,true)). > at scala.Predef$.require(Predef.scala:281) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1304) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1297) > at > org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1309) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1308) > at > org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1313) > at > org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1313) > at > org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:41) > at > org.apache.spark.sql.catalyst.expressions.Alias.dataType(namedExpressions.scala:166) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$.pruneSerializer(objects.scala:209) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.$anonfun$applyOrElse$3(objects.scala:230) > at scala.collection.immutable.List.map(List.scala:293) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:229) > at > org.apache.spark.sql.catalyst.optimizer.ObjectSerializerPruning$$anonfun$apply$8.applyOrElse(objects.scala:217) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.sc
[jira] [Updated] (SPARK-45765) Improve error messages when loading multiple paths in PySpark
[ https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45765: - Description: Currently, the error message is super confusing when a user tries to load multiple paths incorrectly. For example, `spark.read.format("json").load("p1", "p2")` will have this error: An error occurred while calling o36.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: p2. Please find packages at `[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02 This can be confusing but it's valid error message, as "p2" will be considered as the `format` field of the load() method. was: Currently, the error message is super confusing when a user tries to load multiple paths incorrectly. For example, `spark.read.format("json").load("p1", "p2")` will have this error: An error occurred while calling o36.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: p2. Please find packages at `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02 We should fix this. > Improve error messages when loading multiple paths in PySpark > - > > Key: SPARK-45765 > URL: https://issues.apache.org/jira/browse/SPARK-45765 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently, the error message is super confusing when a user tries to load > multiple paths incorrectly. > For example, `spark.read.format("json").load("p1", "p2")` will have this > error: > An error occurred while calling o36.load. > : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] > Failed to find the data source: p2. Please find packages at > `[https://spark.apache.org/third-party-projects.html]`. SQLSTATE: 42K02 > This can be confusing but it's valid error message, as "p2" will be > considered as the `format` field of the load() method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45765) Improve error messages when loading multiple paths in PySpark
[ https://issues.apache.org/jira/browse/SPARK-45765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang resolved SPARK-45765. -- Resolution: Invalid > Improve error messages when loading multiple paths in PySpark > - > > Key: SPARK-45765 > URL: https://issues.apache.org/jira/browse/SPARK-45765 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently, the error message is super confusing when a user tries to load > multiple paths incorrectly. > For example, `spark.read.format("json").load("p1", "p2")` will have this > error: > An error occurred while calling o36.load. > : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] > Failed to find the data source: p2. Please find packages at > `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02 > We should fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45765) Improve error messages when loading multiple paths in PySpark
Allison Wang created SPARK-45765: Summary: Improve error messages when loading multiple paths in PySpark Key: SPARK-45765 URL: https://issues.apache.org/jira/browse/SPARK-45765 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Currently, the error message is super confusing when a user tries to load multiple paths incorrectly. For example, `spark.read.format("json").load("p1", "p2")` will have this error: An error occurred while calling o36.load. : org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: p2. Please find packages at `https://spark.apache.org/third-party-projects.html`. SQLSTATE: 42K02 We should fix this. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45756) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45756: - Assignee: Dongjoon Hyun > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45756) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45756. --- Fix Version/s: 4.0.0 Resolution: Fixed > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45639) Support loading Python data sources in DataFrameReader
[ https://issues.apache.org/jira/browse/SPARK-45639?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45639: --- Labels: pull-request-available (was: ) > Support loading Python data sources in DataFrameReader > -- > > Key: SPARK-45639 > URL: https://issues.apache.org/jira/browse/SPARK-45639 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Allow users to read from a Python data source using > `spark.read.format(...).load()` in PySpark. For example > Users can extend the DataSource and the DataSourceReader classes to create > their own Python data source reader and use them in PySpark: > {code:java} > class MyReader(DataSourceReader): > def read(self, partition): > yield (0, 1) > class MyDataSource(DataSource): > def schema(self): > return "id INT, value INT" > > def reader(self, schema): > return MyReader() > df = spark.read.format("MyDataSource").load() > df.show() > +---+-+ > | id|value| > +---+-+ > | 0| 1| > +---+-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43972) Tests never succeed on pyspark 3.4.0 (work OK on pyspark 3.3.2)
[ https://issues.apache.org/jira/browse/SPARK-43972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781917#comment-17781917 ] Jamie commented on SPARK-43972: --- This issue appears to be fixed in pyspark 3.5.0 Here's a run of the same tests: [https://github.com/jamiekt/jstark/actions/runs/6725570531/job/18280243390] that were [run on pyspark 3.5.0|https://github.com/jamiekt/jstark/actions/runs/6725570531/job/18280243390#step:6:53]. > Tests never succeed on pyspark 3.4.0 (work OK on pyspark 3.3.2) > --- > > Key: SPARK-43972 > URL: https://issues.apache.org/jira/browse/SPARK-43972 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 > Environment: Sorry, not sure what I'm supposed to put in this section. >Reporter: Jamie >Priority: Major > > I have a project that uses pyspark. The tests have always run fine on pyspark > versions prior to pyspark 3.4.0 but now fail on that version (which was > released on 2023-04-13). > My project is configured to use the latest available version of pyspark: > {code:json} > dependencies = [ > "pyspark", > "faker" > ] > {code} > [https://github.com/jamiekt/jstark/blob/c1629cee4e4b8fb0b4471f6fc2941f1b0a99a4bf/pyproject.toml#L26-L29] > The tests are run using GitHub Actions. An example of the failing tests is at > [https://github.com/jamiekt/jstark/actions/runs/4977164046], you can see > there that the tests are run upon various combinations of OS & python > version, they are all cancelled after running for over 5 hours. > If I [pin the version of pyspark to > 3.3.2|https://github.com/jamiekt/jstark/commit/5fd7115d3719a7d6ef2547e8e35feb3ed76ee99f] > then the tests all succeed in ~10 minutes, see > [https://github.com/jamiekt/jstark/actions/runs/5061332947] for such a > successful run. > > This can be reproduced by cloning the repository and running only one test. > The project uses hatch for managing environments and dependencies so you > would need that installed ({{{}pipx install hatch{}}}/{{{}brew install > hatch{}}}). I have reproduced the problem on python3.10. > Reproduce the problem by running these commands: > {code:bash} > # force use of python3.10 > export HATCH_PYTHON=/path/to/python3.10 > git clone https://github.com/jamiekt/jstark.git > cd jstark > # following command will create a virtualenv & install all dependencies, > including pyspark 3.4.0 > hatch run pytest -k test_basketweeks_by_product_and_customer > {code} > On my machine this never completes. I need to CTRL+C to crash out of it. I > consider this to be equivalent behaviour to the tests that fail in the GitHub > Actions pipeline after 6 hours. > Now let's checkout the branch which pins pyspark to 3.3.2 and run the same > thing (the hatch environment will get rebuilt with pyspark 3.3.2) > {code:bash} > git checkout try-pyspark3-3-2 > hatch run pytest -k test_basketweeks_by_product_and_customer > {code} > this time it succeeds in ~31seconds: > {code:bash} > ➜ hatch run pytest -k test_basketweeks_by_product_and_customer > == > test session starts > == > platform darwin -- Python 3.10.10, pytest-7.3.1, pluggy-1.0.0 > rootdir: /private/tmp/jstark > plugins: Faker-18.9.0, cov-4.0.0 > collected 79 items / 78 deselected / 1 selected > tests/test_grocery_retailer_feature_generator.py . > > >[100%] > = 1 passed, > 78 deselected in 31.30s = > {code} > That particular test constructs a very very complex pyspark dataframe which I > suspect might be contributing to the problem, however the issue here is that > it works on pyspark 3.3.2 but not on pyspark 3.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists
[ https://issues.apache.org/jira/browse/SPARK-45763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45763. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43628 [https://github.com/apache/spark/pull/43628] > Improve `MasterPage` to show `Resource` column only when it exists > -- > > Key: SPARK-45763 > URL: https://issues.apache.org/jira/browse/SPARK-45763 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists
[ https://issues.apache.org/jira/browse/SPARK-45763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45763: - Assignee: Dongjoon Hyun > Improve `MasterPage` to show `Resource` column only when it exists > -- > > Key: SPARK-45763 > URL: https://issues.apache.org/jira/browse/SPARK-45763 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45764) Make code block copyable
[ https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45764: - Description: We should consider adding a copy button next to the pyspark code blocks. For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] was: We should consider For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] > Make code block copyable > > > Key: SPARK-45764 > URL: https://issues.apache.org/jira/browse/SPARK-45764 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should consider adding a copy button next to the pyspark code blocks. > For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45764) Make code block copyable
[ https://issues.apache.org/jira/browse/SPARK-45764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781887#comment-17781887 ] Allison Wang commented on SPARK-45764: -- cc [~podongfeng] WDYT? > Make code block copyable > > > Key: SPARK-45764 > URL: https://issues.apache.org/jira/browse/SPARK-45764 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > We should consider adding a copy button next to the pyspark code blocks. > For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45764) Make code block copyable
Allison Wang created SPARK-45764: Summary: Make code block copyable Key: SPARK-45764 URL: https://issues.apache.org/jira/browse/SPARK-45764 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang We should consider For example this plugin: [https://sphinx-copybutton.readthedocs.io/en/latest/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45731) Update partition statistics with ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45731: --- Labels: pull-request-available (was: ) > Update partition statistics with ANALYZE TABLE command > -- > > Key: SPARK-45731 > URL: https://issues.apache.org/jira/browse/SPARK-45731 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Priority: Major > Labels: pull-request-available > > Currently {{ANALYZE TABLE}} command only updates table-level stats but not > partition stats, even though it can be applied to both non-partitioned and > partitioned tables. It seems make sense for it to update partition stats as > well. > Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, > but the syntax is more verbose as they need to specify all the partition > columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45754) Support `spark.deploy.appIdPattern`
[ https://issues.apache.org/jira/browse/SPARK-45754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45754. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43616 [https://github.com/apache/spark/pull/43616] > Support `spark.deploy.appIdPattern` > --- > > Key: SPARK-45754 > URL: https://issues.apache.org/jira/browse/SPARK-45754 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists
[ https://issues.apache.org/jira/browse/SPARK-45763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45763: --- Labels: pull-request-available (was: ) > Improve `MasterPage` to show `Resource` column only when it exists > -- > > Key: SPARK-45763 > URL: https://issues.apache.org/jira/browse/SPARK-45763 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45763) Improve `MasterPage` to show `Resource` column only when it exists
Dongjoon Hyun created SPARK-45763: - Summary: Improve `MasterPage` to show `Resource` column only when it exists Key: SPARK-45763 URL: https://issues.apache.org/jira/browse/SPARK-45763 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38473) Use error classes in org.apache.spark.scheduler
[ https://issues.apache.org/jira/browse/SPARK-38473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781827#comment-17781827 ] Hannah Amundson commented on SPARK-38473: - I am working on this ticket now! > Use error classes in org.apache.spark.scheduler > --- > > Key: SPARK-38473 > URL: https://issues.apache.org/jira/browse/SPARK-38473 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45761) Upgrade `Volcano` to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45761: - Assignee: Dongjoon Hyun > Upgrade `Volcano` to 1.8.1 > -- > > Key: SPARK-45761 > URL: https://issues.apache.org/jira/browse/SPARK-45761 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes, Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > To bring the latest feature and bug fixes in addition to the test coverage > for Volcano scheduler 1.8.1. > [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1] > > [https://github.com/volcano-sh/volcano/pull/3101 > |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 > volcano-sh/volcano#3101) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45761) Upgrade `Volcano` to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45761: -- Description: To bring the latest feature and bug fixes in addition to the test coverage for Volcano scheduler 1.8.1. [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1] [https://github.com/volcano-sh/volcano/pull/3101 |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 volcano-sh/volcano#3101) > Upgrade `Volcano` to 1.8.1 > -- > > Key: SPARK-45761 > URL: https://issues.apache.org/jira/browse/SPARK-45761 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes, Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > To bring the latest feature and bug fixes in addition to the test coverage > for Volcano scheduler 1.8.1. > [https://github.com/volcano-sh/volcano/releases/tag/v1.8.1] > > [https://github.com/volcano-sh/volcano/pull/3101 > |https://github.com/volcano-sh/volcano/pull/3101](volcano adapt k8s v1.27 > volcano-sh/volcano#3101) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes
[ https://issues.apache.org/jira/browse/SPARK-45762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45762: --- Labels: pull-request-available (was: ) > Shuffle managers defined in user jars are not available for some launch modes > - > > Key: SPARK-45762 > URL: https://issues.apache.org/jira/browse/SPARK-45762 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Alessandro Bellina >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Starting a spark job in standalone mode with a custom `ShuffleManager` > provided in a jar via `--jars` does not work. This can also be experienced in > local-cluster mode. > The approach that works consistently is to copy the jar containing the custom > `ShuffleManager` to a specific location in each node then add it to > `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we > would like to move away from setting extra configurations unnecessarily. > Example: > {code:java} > $SPARK_HOME/bin/spark-shell \ > --master spark://127.0.0.1:7077 \ > --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \ > --jars user-code.jar > {code} > This yields `java.lang.ClassNotFoundException` in the executors. > {code:java} > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.spark.examples.TestShuffleManager > at > java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) > at > java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) > at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) > at java.base/java.lang.Class.forName0(Native Method) > at java.base/java.lang.Class.forName(Class.java:467) > at > org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41) > at > org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36) > at org.apache.spark.util.Utils$.classForName(Utils.scala:95) > at > org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574) > at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366) > at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) > at > java.base/java.security.AccessController.doPrivileged(AccessController.java:712) > at java.base/javax.security.auth.Subject.doAs(Subject.java:439) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > ... 4 more > {code} > We can change our command to use `extraClassPath`: > {code:java} > $SPARK_HOME/bin/spark-shell \ > --master spark://127.0.0.1:7077 \ > --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \ > --conf spark.driver.extraClassPath=user-code.jar \ > --conf spark.executor.extraClassPath=user-code.jar > {code} > Success after adding the jar to `extraClassPath`: > {code:java} > 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created > connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps) > 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!! > 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at > /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c > {code} > We would like to change startup order such that the original command > succeeds, without specifying `extraClassPath`: > {code:java} > $SPARK_HOME/bin/spark-shell \ > --master spark://127.0.0.1:7077 \ > --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \ > --jars user-code.jar > {code} > Proposed changes: > Refactor code so we initialize the `ShuffleManager` later, after jars have > been localized. This is especially necessary in the executo
[jira] [Updated] (SPARK-45761) Upgrade `Volcano` to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45761: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Bug) > Upgrade `Volcano` to 1.8.1 > -- > > Key: SPARK-45761 > URL: https://issues.apache.org/jira/browse/SPARK-45761 > Project: Spark > Issue Type: Sub-task > Components: Documentation, Kubernetes, Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45762) Shuffle managers defined in user jars are not available for some launch modes
Alessandro Bellina created SPARK-45762: -- Summary: Shuffle managers defined in user jars are not available for some launch modes Key: SPARK-45762 URL: https://issues.apache.org/jira/browse/SPARK-45762 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Alessandro Bellina Fix For: 4.0.0 Starting a spark job in standalone mode with a custom `ShuffleManager` provided in a jar via `--jars` does not work. This can also be experienced in local-cluster mode. The approach that works consistently is to copy the jar containing the custom `ShuffleManager` to a specific location in each node then add it to `spark.driver.extraClassPath` and `spark.executor.extraClassPath`, but we would like to move away from setting extra configurations unnecessarily. Example: {code:java} $SPARK_HOME/bin/spark-shell \ --master spark://127.0.0.1:7077 \ --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \ --jars user-code.jar {code} This yields `java.lang.ClassNotFoundException` in the executors. {code:java} Exception in thread "main" java.lang.reflect.UndeclaredThrowableException at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1915) at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:436) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:425) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) Caused by: java.lang.ClassNotFoundException: org.apache.spark.examples.TestShuffleManager at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641) at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:520) at java.base/java.lang.Class.forName0(Native Method) at java.base/java.lang.Class.forName(Class.java:467) at org.apache.spark.util.SparkClassUtils.classForName(SparkClassUtils.scala:41) at org.apache.spark.util.SparkClassUtils.classForName$(SparkClassUtils.scala:36) at org.apache.spark.util.Utils$.classForName(Utils.scala:95) at org.apache.spark.util.Utils$.instantiateSerializerOrShuffleManager(Utils.scala:2574) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:366) at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:255) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:487) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61) at java.base/java.security.AccessController.doPrivileged(AccessController.java:712) at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) ... 4 more {code} We can change our command to use `extraClassPath`: {code:java} $SPARK_HOME/bin/spark-shell \ --master spark://127.0.0.1:7077 \ --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \ --conf spark.driver.extraClassPath=user-code.jar \ --conf spark.executor.extraClassPath=user-code.jar {code} Success after adding the jar to `extraClassPath`: {code:java} 23/10/26 12:58:26 INFO TransportClientFactory: Successfully created connection to localhost/127.0.0.1:33053 after 7 ms (0 ms spent in bootstraps) 23/10/26 12:58:26 WARN TestShuffleManager: Instantiated TestShuffleManager!! 23/10/26 12:58:26 INFO DiskBlockManager: Created local directory at /tmp/spark-cb101b05-c4b7-4ba9-8b3d-5b23baa7cb46/executor-5d5335dd-c116-4211-9691-87d8566017fd/blockmgr-2fcb1ab2-d886--8c7f-9dca2c880c2c {code} We would like to change startup order such that the original command succeeds, without specifying `extraClassPath`: {code:java} $SPARK_HOME/bin/spark-shell \ --master spark://127.0.0.1:7077 \ --conf spark.shuffle.manager=org.apache.spark.examples.TestShuffleManager \ --jars user-code.jar {code} Proposed changes: Refactor code so we initialize the `ShuffleManager` later, after jars have been localized. This is especially necessary in the executor, where we would need to move this initialization until after the `replClassLoader` is updated with jars passed in `--jars`. Today, the `ShuffleManager` is instantiated at `SparkEnv` creation. Having to instantiate the `ShuffleManager` this early doesn't work, because user jars have not been localized in all scenarios, and we will fail to load the `ShuffleManager`. We propose moving the `ShuffleManager` instantiation to `SparkContext` on the driver, and Executor,
[jira] [Commented] (SPARK-38668) Spark on Kubernetes: add separate pod watcher service to reduce pressure on K8s API server
[ https://issues.apache.org/jira/browse/SPARK-38668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781821#comment-17781821 ] Hannah Amundson commented on SPARK-38668: - Hello, I will start working on this now! > Spark on Kubernetes: add separate pod watcher service to reduce pressure on > K8s API server > -- > > Key: SPARK-38668 > URL: https://issues.apache.org/jira/browse/SPARK-38668 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.1 >Reporter: BoYang >Priority: Major > > Spark driver will listen to all pods events to manage its executor pods. This > will cause pressure on Kubernetes API server in a large cluster, because > there will be many drivers connect to the API server and watch for the pods. > > An alternative is to have a separate service to listen and watch all pod > events. Then each Spark driver only connects to that service to get pod > events. This will reduce the load on Kubernetes API server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45761) Upgrade `Volcano` to 1.8.1
[ https://issues.apache.org/jira/browse/SPARK-45761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45761: --- Labels: pull-request-available (was: ) > Upgrade `Volcano` to 1.8.1 > -- > > Key: SPARK-45761 > URL: https://issues.apache.org/jira/browse/SPARK-45761 > Project: Spark > Issue Type: Bug > Components: Documentation, Kubernetes, Project Infra >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45761) Upgrade `Volcano` to 1.8.1
Dongjoon Hyun created SPARK-45761: - Summary: Upgrade `Volcano` to 1.8.1 Key: SPARK-45761 URL: https://issues.apache.org/jira/browse/SPARK-45761 Project: Spark Issue Type: Bug Components: Documentation, Kubernetes, Project Infra Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45728) Upgrade `kubernetes-client` to 6.9.1
[ https://issues.apache.org/jira/browse/SPARK-45728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45728: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Bug) > Upgrade `kubernetes-client` to 6.9.1 > > > Key: SPARK-45728 > URL: https://issues.apache.org/jira/browse/SPARK-45728 > Project: Spark > Issue Type: Sub-task > Components: Build, Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45760) Add With expression to avoid duplicating expressions
[ https://issues.apache.org/jira/browse/SPARK-45760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45760: --- Labels: pull-request-available (was: ) > Add With expression to avoid duplicating expressions > > > Key: SPARK-45760 > URL: https://issues.apache.org/jira/browse/SPARK-45760 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Wenchen Fan >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45760) Add With expression to avoid duplicating expressions
Wenchen Fan created SPARK-45760: --- Summary: Add With expression to avoid duplicating expressions Key: SPARK-45760 URL: https://issues.apache.org/jira/browse/SPARK-45760 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45327) Upgrade zstd-jni to 1.5.5-6
[ https://issues.apache.org/jira/browse/SPARK-45327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45327. -- Assignee: BingKun Pan Resolution: Fixed https://github.com/apache/spark/pull/43113 > Upgrade zstd-jni to 1.5.5-6 > --- > > Key: SPARK-45327 > URL: https://issues.apache.org/jira/browse/SPARK-45327 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45753) Support `spark.deploy.driverIdPattern`
[ https://issues.apache.org/jira/browse/SPARK-45753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45753. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43615 [https://github.com/apache/spark/pull/43615] > Support `spark.deploy.driverIdPattern` > -- > > Key: SPARK-45753 > URL: https://issues.apache.org/jira/browse/SPARK-45753 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45502) Upgrade Kafka to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781789#comment-17781789 ] Dongjoon Hyun commented on SPARK-45502: --- KAFKA-7109 was the root cause of revert. > Upgrade Kafka to 3.6.1 > -- > > Key: SPARK-45502 > URL: https://issues.apache.org/jira/browse/SPARK-45502 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Apache Kafka 3.6.0 is released on Oct 10, 2023. > - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45502) Upgrade Kafka to 3.6.1
[ https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45502: -- Summary: Upgrade Kafka to 3.6.1 (was: Upgrade Kafka to 3.6.0) > Upgrade Kafka to 3.6.1 > -- > > Key: SPARK-45502 > URL: https://issues.apache.org/jira/browse/SPARK-45502 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Apache Kafka 3.6.0 is released on Oct 10, 2023. > - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45502) Upgrade Kafka to 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45502: - Assignee: (was: Deng Ziming) > Upgrade Kafka to 3.6.0 > -- > > Key: SPARK-45502 > URL: https://issues.apache.org/jira/browse/SPARK-45502 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Apache Kafka 3.6.0 is released on Oct 10, 2023. > - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45502) Upgrade Kafka to 3.6.0
[ https://issues.apache.org/jira/browse/SPARK-45502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45502: -- Fix Version/s: (was: 4.0.0) > Upgrade Kafka to 3.6.0 > -- > > Key: SPARK-45502 > URL: https://issues.apache.org/jira/browse/SPARK-45502 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > Apache Kafka 3.6.0 is released on Oct 10, 2023. > - https://downloads.apache.org/kafka/3.6.0/RELEASE_NOTES.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45743) Upgrade dropwizard metrics 4.2.21
[ https://issues.apache.org/jira/browse/SPARK-45743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45743: - Assignee: Yang Jie > Upgrade dropwizard metrics 4.2.21 > - > > Key: SPARK-45743 > URL: https://issues.apache.org/jira/browse/SPARK-45743 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > > [https://github.com/dropwizard/metrics/releases/tag/v4.2.21] > [https://github.com/dropwizard/metrics/releases/tag/v4.2.20] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45743) Upgrade dropwizard metrics 4.2.21
[ https://issues.apache.org/jira/browse/SPARK-45743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45743. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43608 [https://github.com/apache/spark/pull/43608] > Upgrade dropwizard metrics 4.2.21 > - > > Key: SPARK-45743 > URL: https://issues.apache.org/jira/browse/SPARK-45743 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [https://github.com/dropwizard/metrics/releases/tag/v4.2.21] > [https://github.com/dropwizard/metrics/releases/tag/v4.2.20] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45743) Upgrade dropwizard metrics 4.2.21
[ https://issues.apache.org/jira/browse/SPARK-45743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45743: -- Parent: SPARK-44111 Issue Type: Sub-task (was: Improvement) > Upgrade dropwizard metrics 4.2.21 > - > > Key: SPARK-45743 > URL: https://issues.apache.org/jira/browse/SPARK-45743 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > [https://github.com/dropwizard/metrics/releases/tag/v4.2.21] > [https://github.com/dropwizard/metrics/releases/tag/v4.2.20] > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45759) Custom metrics should be updated after commit too
[ https://issues.apache.org/jira/browse/SPARK-45759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Ince updated SPARK-45759: - Description: We have a DataWriter component, which processes records in configurable batches, which are accumulated in {{write(T record)}} implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during {{commit()}} call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} and {{DataWriter}} implementations. The problem we see is, since {{CustomMetrics.updateMetrics}} is only called [during|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443] and [just after|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451] record processing, we do not observe the complete metrics since the last batch that is handled during {{commit()}} call is not collected/updated. We propose to also to add {{CustomMetrics.updateMetrics}} call after {{commit()}} is processed successfully, ideally just before {{run}} function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). was: We have a DataWriter component, which processes records in configurable batches, which are accumulated in {{write(T record)}} implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during {{commit()}} call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} and {{DataWriter}} implementations. The problem we see is, since {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443] and [just after|#L451-L451] record processing, we do not observe the complete metrics since the last batch that is handled during {{commit()}} call is not collected/updated. We propose to also to add {{CustomMetrics.updateMetrics}} call after {{commit()}} is processed successfully, ideally just before {{run}} function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). > Custom metrics should be updated after commit too > - > > Key: SPARK-45759 > URL: https://issues.apache.org/jira/browse/SPARK-45759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ali Ince >Priority: Minor > > We have a DataWriter component, which processes records in configurable > batches, which are accumulated in {{write(T record)}} implementation and sent > to the persistent store when the configured batch size is reached. Within > this approach, last batch is handled during {{commit()}} call, as there is no > other mechanism of knowing if there are more records or not. > We are now adding support for custom metrics, by implementing the > {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} > and {{DataWriter}} implementations. The problem we see is, since > {{CustomMetrics.updateMetrics}} is only called > [during|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443] > and [just > after|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451] > record processing, we do not observe the complete metrics since the last > batch that is handled during {{commit()}} call is not collected/updated. > We propose to also to add {{CustomMetrics.updateMetrics}} call after > {{commit()}} is processed successfully, ideally just before {{run}} function > exits (maybe just above > [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). -- This message was sent by Atlassian Jira (v8.20.10#820010) --
[jira] [Updated] (SPARK-45759) Custom metrics should be updated after commit too
[ https://issues.apache.org/jira/browse/SPARK-45759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Ince updated SPARK-45759: - Description: We have a DataWriter component, which processes records in configurable batches, which are accumulated in {{write(T record)}} implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during {{commit()}} call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} and {{DataWriter}} implementations. The problem we see is, since {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443] and [just after|#L451-L451] record processing, we do not observe the complete metrics since the last batch that is handled during {{commit()}} call is not collected/updated. We propose to also to add {{CustomMetrics.updateMetrics}} call after {{commit()}} is processed successfully, ideally just before {{run}} function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). was: We have a DataWriter component, which processes records in configurable batches, which are accumulated in {{write(T record)}} implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during {{commit()}} call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} and {{DataWriter}} implementations. The problem we see is, since {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443]] and [just after|#L451-L451] record processing, we do not observe the complete metrics since the last batch that is handled during {{commit()}} call is not collected/updated. We propose to also to add {{CustomMetrics.updateMetrics}} call after {{commit()}} is processed successfully, ideally just before {{run}} function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). > Custom metrics should be updated after commit too > - > > Key: SPARK-45759 > URL: https://issues.apache.org/jira/browse/SPARK-45759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ali Ince >Priority: Minor > > We have a DataWriter component, which processes records in configurable > batches, which are accumulated in {{write(T record)}} implementation and sent > to the persistent store when the configured batch size is reached. Within > this approach, last batch is handled during {{commit()}} call, as there is no > other mechanism of knowing if there are more records or not. > We are now adding support for custom metrics, by implementing the > {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} > and {{DataWriter}} implementations. The problem we see is, since > {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443] and [just > after|#L451-L451] record processing, we do not observe the complete metrics > since the last batch that is handled during {{commit()}} call is not > collected/updated. > We propose to also to add {{CustomMetrics.updateMetrics}} call after > {{commit()}} is processed successfully, ideally just before {{run}} function > exits (maybe just above > [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45759) Custom metrics should be updated after commit too
[ https://issues.apache.org/jira/browse/SPARK-45759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ali Ince updated SPARK-45759: - Description: We have a DataWriter component, which processes records in configurable batches, which are accumulated in {{write(T record)}} implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during {{commit()}} call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} and {{DataWriter}} implementations. The problem we see is, since {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443]] and [just after|#L451-L451] record processing, we do not observe the complete metrics since the last batch that is handled during {{commit()}} call is not collected/updated. We propose to also to add {{CustomMetrics.updateMetrics}} call after {{commit()}} is processed successfully, ideally just before {{run}} function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). was: We have a DataWriter component, which processes records in configurable batches, which are accumulated in `write(T record)` implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during `commit()` call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the `supportedCustomMetrics()` and `currentMetricsValues()` in the `Write` and `DataWriter` implementations. The problem we see is, since `CustomMetrics.updateMetrics` is only called [during|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443]] and [just after|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451)]] record processing, we do not observe the complete metrics since the last batch that is handled during `commit()` call is not collected/updated. We propose to also to add `CustomMetrics.updateMetrics` call after `commit()` is processed successfully, ideally just before `run` function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). > Custom metrics should be updated after commit too > - > > Key: SPARK-45759 > URL: https://issues.apache.org/jira/browse/SPARK-45759 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: Ali Ince >Priority: Minor > > We have a DataWriter component, which processes records in configurable > batches, which are accumulated in {{write(T record)}} implementation and sent > to the persistent store when the configured batch size is reached. Within > this approach, last batch is handled during {{commit()}} call, as there is no > other mechanism of knowing if there are more records or not. > We are now adding support for custom metrics, by implementing the > {{supportedCustomMetrics()}} and {{currentMetricsValues()}} in the {{Write}} > and {{DataWriter}} implementations. The problem we see is, since > {{CustomMetrics.updateMetrics}} is only called [during|#L443-L443]] and [just > after|#L451-L451] record processing, we do not observe the complete metrics > since the last batch that is handled during {{commit()}} call is not > collected/updated. > We propose to also to add {{CustomMetrics.updateMetrics}} call after > {{commit()}} is processed successfully, ideally just before {{run}} function > exits (maybe just above > [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45758) Introduce a mapper for hadoop compression codecs
[ https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45758: --- Labels: pull-request-available (was: ) > Introduce a mapper for hadoop compression codecs > > > Key: SPARK-45758 > URL: https://issues.apache.org/jira/browse/SPARK-45758 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > Labels: pull-request-available > > Currently, Spark supported partial Hadoop compression codecs, but the Hadoop > supported compression codecs and spark supported are not completely > one-on-one due to Spark introduce two fake compression codecs none and > uncompress. > There are a lot of magic strings copy from Hadoop compression codecs. This > issue lead to developers need to manually maintain its consistency. It is > easy to make mistakes and reduce development efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45759) Custom metrics should be updated after commit too
Ali Ince created SPARK-45759: Summary: Custom metrics should be updated after commit too Key: SPARK-45759 URL: https://issues.apache.org/jira/browse/SPARK-45759 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Ali Ince We have a DataWriter component, which processes records in configurable batches, which are accumulated in `write(T record)` implementation and sent to the persistent store when the configured batch size is reached. Within this approach, last batch is handled during `commit()` call, as there is no other mechanism of knowing if there are more records or not. We are now adding support for custom metrics, by implementing the `supportedCustomMetrics()` and `currentMetricsValues()` in the `Write` and `DataWriter` implementations. The problem we see is, since `CustomMetrics.updateMetrics` is only called [during|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L443-L443]] and [just after|[https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451|https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L451-L451)]] record processing, we do not observe the complete metrics since the last batch that is handled during `commit()` call is not collected/updated. We propose to also to add `CustomMetrics.updateMetrics` call after `commit()` is processed successfully, ideally just before `run` function exits (maybe just above [https://github.com/apache/spark/blob/af8907a0873f5ca192b150f28a0c112107594722/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala#L473-L473]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45758) Introduce a mapper for hadoop compression codecs
[ https://issues.apache.org/jira/browse/SPARK-45758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng updated SPARK-45758: --- Description: Currently, Spark supported partial Hadoop compression codecs, but the Hadoop supported compression codecs and spark supported are not completely one-on-one due to Spark introduce two fake compression codecs none and uncompress. There are a lot of magic strings copy from Hadoop compression codecs. This issue lead to developers need to manually maintain its consistency. It is easy to make mistakes and reduce development efficiency. > Introduce a mapper for hadoop compression codecs > > > Key: SPARK-45758 > URL: https://issues.apache.org/jira/browse/SPARK-45758 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Jiaan Geng >Assignee: Jiaan Geng >Priority: Major > > Currently, Spark supported partial Hadoop compression codecs, but the Hadoop > supported compression codecs and spark supported are not completely > one-on-one due to Spark introduce two fake compression codecs none and > uncompress. > There are a lot of magic strings copy from Hadoop compression codecs. This > issue lead to developers need to manually maintain its consistency. It is > easy to make mistakes and reduce development efficiency. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45758) Introduce a mapper for hadoop compression codecs
Jiaan Geng created SPARK-45758: -- Summary: Introduce a mapper for hadoop compression codecs Key: SPARK-45758 URL: https://issues.apache.org/jira/browse/SPARK-45758 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Jiaan Geng Assignee: Jiaan Geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45755) Push down limit through Dataset.isEmpty()
[ https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng reassigned SPARK-45755: -- Assignee: Yuming Wang > Push down limit through Dataset.isEmpty() > - > > Key: SPARK-45755 > URL: https://issues.apache.org/jira/browse/SPARK-45755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > > Push down LocalLimit can not optimize the case of distinct. > {code:scala} > def isEmpty: Boolean = withAction("isEmpty", > withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) > }.queryExecution) { plan => > plan.executeTake(1).isEmpty > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45755) Push down limit through Dataset.isEmpty()
[ https://issues.apache.org/jira/browse/SPARK-45755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jiaan Geng resolved SPARK-45755. Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43617 [https://github.com/apache/spark/pull/43617] > Push down limit through Dataset.isEmpty() > - > > Key: SPARK-45755 > URL: https://issues.apache.org/jira/browse/SPARK-45755 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Push down LocalLimit can not optimize the case of distinct. > {code:scala} > def isEmpty: Boolean = withAction("isEmpty", > withTypedPlan { LocalLimit(Literal(1), select().logicalPlan) > }.queryExecution) { plan => > plan.executeTake(1).isEmpty > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44896) Consider adding information os_prio, cpu, elapsed, tid, nid, etc., from the jstack tool
[ https://issues.apache.org/jira/browse/SPARK-44896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781694#comment-17781694 ] Kent Yao commented on SPARK-44896: -- Hi [~hannahkamundson], Sure, feel free to send a PR for this issue. Speaking of your project, there is also an ongoing contribution program launched by the apache kyuubi community. See [https://github.com/orgs/apache/projects/296?pane=info] Thank you Kent > Consider adding information os_prio, cpu, elapsed, tid, nid, etc., from the > jstack tool > > > Key: SPARK-44896 > URL: https://issues.apache.org/jira/browse/SPARK-44896 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect
[ https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-45751. -- Fix Version/s: 3.3.4 3.5.1 4.0.0 3.4.2 Resolution: Fixed Issue resolved by pull request 43618 [https://github.com/apache/spark/pull/43618] > The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the > official website is incorrect > > > Key: SPARK-45751 > URL: https://issues.apache.org/jira/browse/SPARK-45751 > Project: Spark > Issue Type: Improvement > Components: Spark Core, UI >Affects Versions: 3.5.0 >Reporter: chenyu >Assignee: chenyu >Priority: Trivial > Labels: pull-request-available > Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2 > > Attachments: the default value.png, the value on the website.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45680) ReleaseSession to close Spark Connect session
[ https://issues.apache.org/jira/browse/SPARK-45680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45680: -- Assignee: (was: Apache Spark) > ReleaseSession to close Spark Connect session > - > > Key: SPARK-45680 > URL: https://issues.apache.org/jira/browse/SPARK-45680 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Juliusz Sompolski >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect
[ https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-45751: Assignee: chenyu > The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the > official website is incorrect > > > Key: SPARK-45751 > URL: https://issues.apache.org/jira/browse/SPARK-45751 > Project: Spark > Issue Type: Improvement > Components: Spark Core, UI >Affects Versions: 3.5.0 >Reporter: chenyu >Assignee: chenyu >Priority: Trivial > Labels: pull-request-available > Attachments: the default value.png, the value on the website.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect
[ https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45751: -- Assignee: (was: Apache Spark) > The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the > official website is incorrect > > > Key: SPARK-45751 > URL: https://issues.apache.org/jira/browse/SPARK-45751 > Project: Spark > Issue Type: Improvement > Components: Spark Core, UI >Affects Versions: 3.5.0 >Reporter: chenyu >Priority: Trivial > Labels: pull-request-available > Attachments: the default value.png, the value on the website.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45751) The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the official website is incorrect
[ https://issues.apache.org/jira/browse/SPARK-45751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45751: -- Assignee: Apache Spark > The default value of ‘spark.executor.logs.rolling.maxRetainedFiles' on the > official website is incorrect > > > Key: SPARK-45751 > URL: https://issues.apache.org/jira/browse/SPARK-45751 > Project: Spark > Issue Type: Improvement > Components: Spark Core, UI >Affects Versions: 3.5.0 >Reporter: chenyu >Assignee: Apache Spark >Priority: Trivial > Labels: pull-request-available > Attachments: the default value.png, the value on the website.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45022) Provide context for dataset API errors
[ https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-45022: Assignee: Max Gekk > Provide context for dataset API errors > -- > > Key: SPARK-45022 > URL: https://issues.apache.org/jira/browse/SPARK-45022 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > > SQL failures already provide nice error context when there is a failure: > {noformat} > org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. > Use `try_divide` to tolerate divisor being 0 and return NULL instead. If > necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > == SQL(line 1, position 1) == > a / b > ^ > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) > ... > {noformat} > We could add a similar user friendly error context to Dataset APIs. > E.g. consider the following Spark app SimpleApp.scala: > {noformat} >1 import org.apache.spark.sql.SparkSession >2 import org.apache.spark.sql.functions._ >3 >4 object SimpleApp { >5def main(args: Array[String]) { >6 val spark = SparkSession.builder.appName("Simple > Application").config("spark.sql.ansi.enabled", true).getOrCreate() >7 import spark.implicits._ >8 >9 val c = col("a") / col("b") > 10 > 11 Seq((1, 0)).toDF("a", "b").select(c).show() > 12 > 13 spark.stop() > 14} > 15 } > {noformat} > then the error context could be: > {noformat} > Exception in thread "main" org.apache.spark.SparkArithmeticException: > [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being > 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. > == Dataset == > "div" was called from SimpleApp$.main(SimpleApp.scala:9) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45022) Provide context for dataset API errors
[ https://issues.apache.org/jira/browse/SPARK-45022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-45022. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43334 [https://github.com/apache/spark/pull/43334] > Provide context for dataset API errors > -- > > Key: SPARK-45022 > URL: https://issues.apache.org/jira/browse/SPARK-45022 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Peter Toth >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > SQL failures already provide nice error context when there is a failure: > {noformat} > org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. > Use `try_divide` to tolerate divisor being 0 and return NULL instead. If > necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. > == SQL(line 1, position 1) == > a / b > ^ > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.errors.QueryExecutionErrors.divideByZeroError(QueryExecutionErrors.scala) > ... > {noformat} > We could add a similar user friendly error context to Dataset APIs. > E.g. consider the following Spark app SimpleApp.scala: > {noformat} >1 import org.apache.spark.sql.SparkSession >2 import org.apache.spark.sql.functions._ >3 >4 object SimpleApp { >5def main(args: Array[String]) { >6 val spark = SparkSession.builder.appName("Simple > Application").config("spark.sql.ansi.enabled", true).getOrCreate() >7 import spark.implicits._ >8 >9 val c = col("a") / col("b") > 10 > 11 Seq((1, 0)).toDF("a", "b").select(c).show() > 12 > 13 spark.stop() > 14} > 15 } > {noformat} > then the error context could be: > {noformat} > Exception in thread "main" org.apache.spark.SparkArithmeticException: > [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being > 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to > "false" to bypass this error. > == Dataset == > "div" was called from SimpleApp$.main(SimpleApp.scala:9) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.divideByZeroError(QueryExecutionErrors.scala:201) > at > org.apache.spark.sql.catalyst.expressions.DivModLike.eval(arithmetic.scala:672 > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45174) Support `spark.deploy.maxDrivers`
[ https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45174: -- Summary: Support `spark.deploy.maxDrivers` (was: Support spark.deploy.maxDrivers) > Support `spark.deploy.maxDrivers` > - > > Key: SPARK-45174 > URL: https://issues.apache.org/jira/browse/SPARK-45174 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Like `spark.mesos.maxDrivers`, this issue aims to add > `spark.deploy.maxDrivers`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45497) Add a symbolic link file `spark-examples.jar` in K8s Docker images
[ https://issues.apache.org/jira/browse/SPARK-45497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45497: -- Parent: (was: SPARK-45756) Issue Type: Improvement (was: Sub-task) > Add a symbolic link file `spark-examples.jar` in K8s Docker images > -- > > Key: SPARK-45497 > URL: https://issues.apache.org/jira/browse/SPARK-45497 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45497) Add a symbolic link file `spark-examples.jar` in K8s Docker images
[ https://issues.apache.org/jira/browse/SPARK-45497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45497: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Add a symbolic link file `spark-examples.jar` in K8s Docker images > -- > > Key: SPARK-45497 > URL: https://issues.apache.org/jira/browse/SPARK-45497 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44214) Support Spark Driver Live Log UI
[ https://issues.apache.org/jira/browse/SPARK-44214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44214: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Support Spark Driver Live Log UI > > > Key: SPARK-44214 > URL: https://issues.apache.org/jira/browse/SPARK-44214 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45756) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45756: -- Labels: releasenotes (was: ) > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: releasenotes > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45754) Support `spark.deploy.appIdPattern`
[ https://issues.apache.org/jira/browse/SPARK-45754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45754: - Assignee: Dongjoon Hyun > Support `spark.deploy.appIdPattern` > --- > > Key: SPARK-45754 > URL: https://issues.apache.org/jira/browse/SPARK-45754 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45753) Support `spark.deploy.driverIdPattern`
[ https://issues.apache.org/jira/browse/SPARK-45753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45753: - Assignee: Dongjoon Hyun > Support `spark.deploy.driverIdPattern` > -- > > Key: SPARK-45753 > URL: https://issues.apache.org/jira/browse/SPARK-45753 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45757) Avoid re-computation of NNZ in Binarizer
[ https://issues.apache.org/jira/browse/SPARK-45757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45757: --- Labels: pull-request-available (was: ) > Avoid re-computation of NNZ in Binarizer > > > Key: SPARK-45757 > URL: https://issues.apache.org/jira/browse/SPARK-45757 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45753) Support `spark.deploy.driverIdPattern`
[ https://issues.apache.org/jira/browse/SPARK-45753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45753: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Support `spark.deploy.driverIdPattern` > -- > > Key: SPARK-45753 > URL: https://issues.apache.org/jira/browse/SPARK-45753 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45756) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45756: -- Summary: Revisit and Improve Spark Standalone Cluster (was: Improve Spark Standalone Cluster) > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45754) Support `spark.deploy.appIdPattern`
[ https://issues.apache.org/jira/browse/SPARK-45754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45754: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Support `spark.deploy.appIdPattern` > --- > > Key: SPARK-45754 > URL: https://issues.apache.org/jira/browse/SPARK-45754 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45749) Fix Spark History Server to sort `Duration` column properly
[ https://issues.apache.org/jira/browse/SPARK-45749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45749: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Bug) > Fix Spark History Server to sort `Duration` column properly > --- > > Key: SPARK-45749 > URL: https://issues.apache.org/jira/browse/SPARK-45749 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 3.2.0, 3.3.2, 3.4.1, 3.5.0, 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45757) Avoid re-computation of NNZ in Binarizer
Ruifeng Zheng created SPARK-45757: - Summary: Avoid re-computation of NNZ in Binarizer Key: SPARK-45757 URL: https://issues.apache.org/jira/browse/SPARK-45757 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45500) Show the number of abnormally completed drivers in MasterPage
[ https://issues.apache.org/jira/browse/SPARK-45500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45500: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Show the number of abnormally completed drivers in MasterPage > - > > Key: SPARK-45500 > URL: https://issues.apache.org/jira/browse/SPARK-45500 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45474) Support top-level filtering in MasterPage JSON API
[ https://issues.apache.org/jira/browse/SPARK-45474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45474: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Support top-level filtering in MasterPage JSON API > -- > > Key: SPARK-45474 > URL: https://issues.apache.org/jira/browse/SPARK-45474 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45197) Make StandaloneRestServer add JavaModuleOptions to drivers
[ https://issues.apache.org/jira/browse/SPARK-45197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45197: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Bug) > Make StandaloneRestServer add JavaModuleOptions to drivers > -- > > Key: SPARK-45197 > URL: https://issues.apache.org/jira/browse/SPARK-45197 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45187) Fix WorkerPage to use the same pattern for `logPage` urls
[ https://issues.apache.org/jira/browse/SPARK-45187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45187: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Bug) > Fix WorkerPage to use the same pattern for `logPage` urls > - > > Key: SPARK-45187 > URL: https://issues.apache.org/jira/browse/SPARK-45187 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45197) Make StandaloneRestServer add JavaModuleOptions to drivers
[ https://issues.apache.org/jira/browse/SPARK-45197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45197: -- Parent: (was: SPARK-43831) Issue Type: Bug (was: Sub-task) > Make StandaloneRestServer add JavaModuleOptions to drivers > -- > > Key: SPARK-45197 > URL: https://issues.apache.org/jira/browse/SPARK-45197 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45174) Support spark.deploy.maxDrivers
[ https://issues.apache.org/jira/browse/SPARK-45174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45174: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Improvement) > Support spark.deploy.maxDrivers > --- > > Key: SPARK-45174 > URL: https://issues.apache.org/jira/browse/SPARK-45174 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Like `spark.mesos.maxDrivers`, this issue aims to add > `spark.deploy.maxDrivers`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44857) Fix getBaseURI error in Spark Worker LogPage UI buttons
[ https://issues.apache.org/jira/browse/SPARK-44857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-44857: -- Parent: SPARK-45756 Issue Type: Sub-task (was: Bug) > Fix getBaseURI error in Spark Worker LogPage UI buttons > --- > > Key: SPARK-44857 > URL: https://issues.apache.org/jira/browse/SPARK-44857 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 3.2.0, 3.2.4, 3.3.2, 3.4.1, 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4 > > Attachments: Screenshot 2023-08-17 at 2.38.45 PM.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45756) Improve Spark Standalone Cluster
Dongjoon Hyun created SPARK-45756: - Summary: Improve Spark Standalone Cluster Key: SPARK-45756 URL: https://issues.apache.org/jira/browse/SPARK-45756 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org