[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557229#comment-17557229 ] Yang Jie commented on SPARK-39519: -- The default -XX:NewRatio is 2, change it to 3 for sql/core module to enlarge the size of the old area maybe ok. I'm testing it. > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39536) to_date function is returning incorrect value
[ https://issues.apache.org/jira/browse/SPARK-39536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39536. -- Resolution: Invalid > to_date function is returning incorrect value > - > > Key: SPARK-39536 > URL: https://issues.apache.org/jira/browse/SPARK-39536 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.1 > Environment: I'm facing this issue in databricks community edition. > I'm using DBR 10.4 LTS. >Reporter: Sridhar Varanasi >Priority: Major > Attachments: to_date_issue.PNG > > > Hi, > > I have a dataframe which has a column containing dates in string format. Now > while converting this to date type using to_date , it's giving incorrect date > format values. Following is the example code. > > > df = spark.createDataFrame( > [("11/25/1991",), ("1/2/1991",), ("11/30/1991",)], > ['date_str'] > ) > > spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY") > > df = (df > .withColumn('new_date' > ,to_date(col('date_str'),'mm/dd/'))) > display(df) > > > In the above dataframe we get the date converted correctly for the 2nd row > but for 1st and 3rd row we are getting incorrect dates post conversion. > > > Could you please look into this issue? > > Thanks, > Sridhar -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39549) How to get access to the data created in different Spark Applications
[ https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39549: - Component/s: (was: Project Infra) > How to get access to the data created in different Spark Applications > - > > Key: SPARK-39549 > URL: https://issues.apache.org/jira/browse/SPARK-39549 > Project: Spark > Issue Type: Question > Components: Pandas API on Spark, PySpark >Affects Versions: 3.3.0 >Reporter: Chenyang Zhang >Priority: Major > > I am working on a project using PySpark and I am blocked because I want to > share data between different Spark applications. The situation is that we > have a running java server which can handles incoming requests with a thread > pool, and each thread has a corresponding python process. We want to use > pandas on Spark, but have it so that any of the python processes can access > the same data in spark. For example, in a python process, we created a > SparkSession, read some data, modified the data using pandas on Spark api and > we want to get access to that data in a different python process. The core > problem is how to share data between different SparkSession or how to let > different python process connect to the same SparkSession. I researched a bit > but it seems impossible to share data between different python process > without using external DB or connect to the same SparkSession. Generally, is > this possible / what would be the recommended way to do this with the least > impact on performance? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39549) How to get access to the data created in different Spark Applications
[ https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39549. -- Resolution: Invalid For questions, let's leverage Spark mailing list. > How to get access to the data created in different Spark Applications > - > > Key: SPARK-39549 > URL: https://issues.apache.org/jira/browse/SPARK-39549 > Project: Spark > Issue Type: Question > Components: Pandas API on Spark, Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Chenyang Zhang >Priority: Major > > I am working on a project using PySpark and I am blocked because I want to > share data between different Spark applications. The situation is that we > have a running java server which can handles incoming requests with a thread > pool, and each thread has a corresponding python process. We want to use > pandas on Spark, but have it so that any of the python processes can access > the same data in spark. For example, in a python process, we created a > SparkSession, read some data, modified the data using pandas on Spark api and > we want to get access to that data in a different python process. The core > problem is how to share data between different SparkSession or how to let > different python process connect to the same SparkSession. I researched a bit > but it seems impossible to share data between different python process > without using external DB or connect to the same SparkSession. Generally, is > this possible / what would be the recommended way to do this with the least > impact on performance? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39549) How to get access to the data created in different Spark Applications
[ https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557202#comment-17557202 ] Hyukjin Kwon commented on SPARK-39549: -- You should either write down into a file or a table, and read it in a different Spark application. Or, have to implement a logic to share one Spark session (e.g., zeppelin does). > How to get access to the data created in different Spark Applications > - > > Key: SPARK-39549 > URL: https://issues.apache.org/jira/browse/SPARK-39549 > Project: Spark > Issue Type: Question > Components: Pandas API on Spark, Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Chenyang Zhang >Priority: Major > > I am working on a project using PySpark and I am blocked because I want to > share data between different Spark applications. The situation is that we > have a running java server which can handles incoming requests with a thread > pool, and each thread has a corresponding python process. We want to use > pandas on Spark, but have it so that any of the python processes can access > the same data in spark. For example, in a python process, we created a > SparkSession, read some data, modified the data using pandas on Spark api and > we want to get access to that data in a different python process. The core > problem is how to share data between different SparkSession or how to let > different python process connect to the same SparkSession. I researched a bit > but it seems impossible to share data between different python process > without using external DB or connect to the same SparkSession. Generally, is > this possible / what would be the recommended way to do this with the least > impact on performance? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39551) Add AQE invalid plan check
[ https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557201#comment-17557201 ] Apache Spark commented on SPARK-39551: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/36953 > Add AQE invalid plan check > -- > > Key: SPARK-39551 > URL: https://issues.apache.org/jira/browse/SPARK-39551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wei Xue >Priority: Minor > > AQE logical optimization rules can lead to invalid physical plans as certain > physical plan nodes are not compatible with others. E.g., > `BroadcastExchangeExec` can only work as a direct child of broadcast join > nodes. > Logical optimizations, on the other hand, are not (and should not be) aware > of such restrictions. So a general solution here is to check for invalid > plans and throw exceptions, which can be caught by AQE replanning process. > And if such an exception is captured, AQE can void the current replanning > result and keep using the latest valid plan. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39551) Add AQE invalid plan check
[ https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557199#comment-17557199 ] Apache Spark commented on SPARK-39551: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/36953 > Add AQE invalid plan check > -- > > Key: SPARK-39551 > URL: https://issues.apache.org/jira/browse/SPARK-39551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wei Xue >Priority: Minor > > AQE logical optimization rules can lead to invalid physical plans as certain > physical plan nodes are not compatible with others. E.g., > `BroadcastExchangeExec` can only work as a direct child of broadcast join > nodes. > Logical optimizations, on the other hand, are not (and should not be) aware > of such restrictions. So a general solution here is to check for invalid > plans and throw exceptions, which can be caught by AQE replanning process. > And if such an exception is captured, AQE can void the current replanning > result and keep using the latest valid plan. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39551) Add AQE invalid plan check
[ https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39551: Assignee: (was: Apache Spark) > Add AQE invalid plan check > -- > > Key: SPARK-39551 > URL: https://issues.apache.org/jira/browse/SPARK-39551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wei Xue >Priority: Minor > > AQE logical optimization rules can lead to invalid physical plans as certain > physical plan nodes are not compatible with others. E.g., > `BroadcastExchangeExec` can only work as a direct child of broadcast join > nodes. > Logical optimizations, on the other hand, are not (and should not be) aware > of such restrictions. So a general solution here is to check for invalid > plans and throw exceptions, which can be caught by AQE replanning process. > And if such an exception is captured, AQE can void the current replanning > result and keep using the latest valid plan. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39551) Add AQE invalid plan check
[ https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39551: Assignee: Apache Spark > Add AQE invalid plan check > -- > > Key: SPARK-39551 > URL: https://issues.apache.org/jira/browse/SPARK-39551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wei Xue >Assignee: Apache Spark >Priority: Minor > > AQE logical optimization rules can lead to invalid physical plans as certain > physical plan nodes are not compatible with others. E.g., > `BroadcastExchangeExec` can only work as a direct child of broadcast join > nodes. > Logical optimizations, on the other hand, are not (and should not be) aware > of such restrictions. So a general solution here is to check for invalid > plans and throw exceptions, which can be caught by AQE replanning process. > And if such an exception is captured, AQE can void the current replanning > result and keep using the latest valid plan. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39551) Add AQE invalid plan check
Wei Xue created SPARK-39551: --- Summary: Add AQE invalid plan check Key: SPARK-39551 URL: https://issues.apache.org/jira/browse/SPARK-39551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Wei Xue AQE logical optimization rules can lead to invalid physical plans as certain physical plan nodes are not compatible with others. E.g., `BroadcastExchangeExec` can only work as a direct child of broadcast join nodes. Logical optimizations, on the other hand, are not (and should not be) aware of such restrictions. So a general solution here is to check for invalid plans and throw exceptions, which can be caught by AQE replanning process. And if such an exception is captured, AQE can void the current replanning result and keep using the latest valid plan. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
[ https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39545: - Description: {{ExpressionSet ++}} method in the master branch a little slower than the branch-3.3 with Scala-2.13 For example, write a microbenchmark as follows and run with Scala 2.13: {code:java} val valuesPerIteration = 10 val benchmark = new Benchmark("Test ExpressionSet ++ ", valuesPerIteration, output = output) val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1)) val initialSet = ExpressionSet(aUpper + 1 :: Rand(0) :: Nil) val setToAddWithSameDeterministicExpression = ExpressionSet(aUpper + 1 :: Rand(0) :: Nil) benchmark.addCase("Test ++") { _: Int => for (_ <- 0L until valuesPerIteration) { initialSet ++ setToAddWithSameDeterministicExpression } } benchmark.run() {code} *branch-3.3 result:* {code:java} OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45 Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz Test ExpressionSet ++ : Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative Test ++ 14 16 4 7.2 139.1 1.0X {code} *master result :* {code:java} OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45 Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz Test ExpressionSet ++ : Best Time(ms) Avg Time(ms) Stdev(ms)Rate(M/s) Per Row(ns) Relative Test ++ 16 19 5 6.1 163.9 1.0X {code} was:ExpressionSet ++ with > Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the > performance > - > > Key: SPARK-39545 > URL: https://issues.apache.org/jira/browse/SPARK-39545 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > {{ExpressionSet ++}} method in the master branch a little slower than the > branch-3.3 with Scala-2.13 > > For example, write a microbenchmark as follows and run with Scala 2.13: > {code:java} > val valuesPerIteration = 10 > val benchmark = new Benchmark("Test ExpressionSet ++ ", > valuesPerIteration, output = output) > val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1)) > val initialSet = ExpressionSet(aUpper + 1 :: Rand(0) :: Nil) > val setToAddWithSameDeterministicExpression = ExpressionSet(aUpper + 1 :: > Rand(0) :: Nil) > benchmark.addCase("Test ++") { _: Int => > for (_ <- 0L until valuesPerIteration) { > initialSet ++ setToAddWithSameDeterministicExpression > } > } > benchmark.run() {code} > *branch-3.3 result:* > > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45 > Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz > Test ExpressionSet ++ : Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > Test ++ 14 16 >4 7.2 139.1 1.0X > {code} > > *master result :* > {code:java} > OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45 > Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz > Test ExpressionSet ++ : Best Time(ms) Avg Time(ms) > Stdev(ms)Rate(M/s) Per Row(ns) Relative > > Test ++ 16 19 >5 6.1 163.9 1.0X > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM
[ https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39541: Assignee: (was: Apache Spark) > [Yarn] Diagnostics of yarn UI did not display the exception of driver when > driver exit before regiserAM > --- > > Key: SPARK-39541 > URL: https://issues.apache.org/jira/browse/SPARK-39541 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: liangyongyuan >Priority: Major > > If commit a job in yarn cluster mode and driver exited before > registerAM,Diagnostics of yarn UI did not show the exception that was throwed > by driver .Yarn UI only show : > Application application_xxx failed 1 times (global limit =10; local limit is > =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13 > > User must view spark log to find the real reason.for example,spark log shows > {code:java} > 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: > User class threw exception: java.lang.ArithmeticException: / by zero > java.lang.ArithmeticException: / by zero > at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10) > at org.examples.appErrorDemo3.main(appErrorDemo3.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736) > {code} > > The reason of this issue is that if driver would not call unregisterAM exited > before registerAM ,then yarn UI could not show the real diagnostic > information. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM
[ https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557190#comment-17557190 ] Apache Spark commented on SPARK-39541: -- User 'lyy-pineapple' has created a pull request for this issue: https://github.com/apache/spark/pull/36952 > [Yarn] Diagnostics of yarn UI did not display the exception of driver when > driver exit before regiserAM > --- > > Key: SPARK-39541 > URL: https://issues.apache.org/jira/browse/SPARK-39541 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: liangyongyuan >Priority: Major > > If commit a job in yarn cluster mode and driver exited before > registerAM,Diagnostics of yarn UI did not show the exception that was throwed > by driver .Yarn UI only show : > Application application_xxx failed 1 times (global limit =10; local limit is > =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13 > > User must view spark log to find the real reason.for example,spark log shows > {code:java} > 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: > User class threw exception: java.lang.ArithmeticException: / by zero > java.lang.ArithmeticException: / by zero > at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10) > at org.examples.appErrorDemo3.main(appErrorDemo3.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736) > {code} > > The reason of this issue is that if driver would not call unregisterAM exited > before registerAM ,then yarn UI could not show the real diagnostic > information. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM
[ https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39541: Assignee: Apache Spark > [Yarn] Diagnostics of yarn UI did not display the exception of driver when > driver exit before regiserAM > --- > > Key: SPARK-39541 > URL: https://issues.apache.org/jira/browse/SPARK-39541 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: liangyongyuan >Assignee: Apache Spark >Priority: Major > > If commit a job in yarn cluster mode and driver exited before > registerAM,Diagnostics of yarn UI did not show the exception that was throwed > by driver .Yarn UI only show : > Application application_xxx failed 1 times (global limit =10; local limit is > =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13 > > User must view spark log to find the real reason.for example,spark log shows > {code:java} > 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: > User class threw exception: java.lang.ArithmeticException: / by zero > java.lang.ArithmeticException: / by zero > at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10) > at org.examples.appErrorDemo3.main(appErrorDemo3.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736) > {code} > > The reason of this issue is that if driver would not call unregisterAM exited > before registerAM ,then yarn UI could not show the real diagnostic > information. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results
[ https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557179#comment-17557179 ] Yuming Wang commented on SPARK-38614: - Thank you for reporting this issue. Workaround: {code:sql} set spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LimitPushDownThroughWindow; {code} > After Spark update, df.show() shows incorrect F.percent_rank results > > > Key: SPARK-38614 > URL: https://issues.apache.org/jira/browse/SPARK-38614 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: ZygD >Priority: Major > Labels: correctness > > Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0. > *Minimal reproducible example* > {code:java} > from pyspark.sql import SparkSession, functions as F, Window as W > spark = SparkSession.builder.getOrCreate() > > df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id'))) > df.show(3) > df.show(5) {code} > *Expected result* > {code:java} > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > +---++ > only showing top 3 rows > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > | 3|0.03| > | 4|0.04| > +---++ > only showing top 5 rows{code} > *Actual result* > {code:java} > +---+--+ > | id|pr| > +---+--+ > | 0| 0.0| > | 1|0.| > | 2|0.| > +---+--+ > only showing top 3 rows > +---+---+ > | id| pr| > +---+---+ > | 0|0.0| > | 1|0.2| > | 2|0.4| > | 3|0.6| > | 4|0.8| > +---+---+ > only showing top 5 rows{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39533) Deprecate scoreLabelsWeight in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-39533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-39533: - Summary: Deprecate scoreLabelsWeight in BinaryClassificationMetrics (was: Remove scoreLabelsWeight in BinaryClassificationMetrics) > Deprecate scoreLabelsWeight in BinaryClassificationMetrics > -- > > Key: SPARK-39533 > URL: https://issues.apache.org/jira/browse/SPARK-39533 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Minor > > scoreLabelsWeight in BinaryClassificationMetrics is a public variable, > but it should be private, moveover, it is only used once, so move it to the > internal call place. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39533) Deprecate scoreLabelsWeight in BinaryClassificationMetrics
[ https://issues.apache.org/jira/browse/SPARK-39533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-39533: - Description: scoreLabelsWeight in BinaryClassificationMetrics is a public variable, but it should be private, moveover, it is only used once, so deprecate it now and remove it in 4.0.0 was: scoreLabelsWeight in BinaryClassificationMetrics is a public variable, but it should be private, moveover, it is only used once, so move it to the internal call place. > Deprecate scoreLabelsWeight in BinaryClassificationMetrics > -- > > Key: SPARK-39533 > URL: https://issues.apache.org/jira/browse/SPARK-39533 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 3.4.0 >Reporter: zhengruifeng >Priority: Minor > > scoreLabelsWeight in BinaryClassificationMetrics is a public variable, > but it should be private, moveover, it is only used once, so deprecate it now > and remove it in 4.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39540) Upgrade mysql-connector-java to 8.0.28
[ https://issues.apache.org/jira/browse/SPARK-39540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39540. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36938 [https://github.com/apache/spark/pull/36938] > Upgrade mysql-connector-java to 8.0.28 > -- > > Key: SPARK-39540 > URL: https://issues.apache.org/jira/browse/SPARK-39540 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > > Improper Handling of Insufficient Permissions or Privileges in MySQL > Connectors Java. > Vulnerability in the MySQL Connectors product of Oracle MySQL (component: > Connector/J). Supported versions that are affected are 8.0.27 and prior. > Difficult to exploit vulnerability allows high privileged attacker with > network access via multiple protocols to compromise MySQL Connectors. > Successful attacks of this vulnerability can result in takeover of MySQL > Connectors. CVSS 3.1 Base Score 6.6 (Confidentiality, Integrity and > Availability impacts). CVSS Vector: > (CVSS:3.1/AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:H/A:H). > [CVE-2022-21363|https://nvd.nist.gov/vuln/detail/CVE-2022-21363] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39540) Upgrade mysql-connector-java to 8.0.28
[ https://issues.apache.org/jira/browse/SPARK-39540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39540: - Assignee: Bjørn Jørgensen > Upgrade mysql-connector-java to 8.0.28 > -- > > Key: SPARK-39540 > URL: https://issues.apache.org/jira/browse/SPARK-39540 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > > Improper Handling of Insufficient Permissions or Privileges in MySQL > Connectors Java. > Vulnerability in the MySQL Connectors product of Oracle MySQL (component: > Connector/J). Supported versions that are affected are 8.0.27 and prior. > Difficult to exploit vulnerability allows high privileged attacker with > network access via multiple protocols to compromise MySQL Connectors. > Successful attacks of this vulnerability can result in takeover of MySQL > Connectors. CVSS 3.1 Base Score 6.6 (Confidentiality, Integrity and > Availability impacts). CVSS Vector: > (CVSS:3.1/AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:H/A:H). > [CVE-2022-21363|https://nvd.nist.gov/vuln/detail/CVE-2022-21363] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39540) Upgrade mysql-connector-java to 8.0.29
[ https://issues.apache.org/jira/browse/SPARK-39540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39540: -- Summary: Upgrade mysql-connector-java to 8.0.29 (was: Upgrade mysql-connector-java to 8.0.28) > Upgrade mysql-connector-java to 8.0.29 > -- > > Key: SPARK-39540 > URL: https://issues.apache.org/jira/browse/SPARK-39540 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > > Improper Handling of Insufficient Permissions or Privileges in MySQL > Connectors Java. > Vulnerability in the MySQL Connectors product of Oracle MySQL (component: > Connector/J). Supported versions that are affected are 8.0.27 and prior. > Difficult to exploit vulnerability allows high privileged attacker with > network access via multiple protocols to compromise MySQL Connectors. > Successful attacks of this vulnerability can result in takeover of MySQL > Connectors. CVSS 3.1 Base Score 6.6 (Confidentiality, Integrity and > Availability impacts). CVSS Vector: > (CVSS:3.1/AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:H/A:H). > [CVE-2022-21363|https://nvd.nist.gov/vuln/detail/CVE-2022-21363] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results
[ https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38614: Assignee: Apache Spark > After Spark update, df.show() shows incorrect F.percent_rank results > > > Key: SPARK-38614 > URL: https://issues.apache.org/jira/browse/SPARK-38614 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: ZygD >Assignee: Apache Spark >Priority: Major > Labels: correctness > > Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0. > *Minimal reproducible example* > {code:java} > from pyspark.sql import SparkSession, functions as F, Window as W > spark = SparkSession.builder.getOrCreate() > > df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id'))) > df.show(3) > df.show(5) {code} > *Expected result* > {code:java} > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > +---++ > only showing top 3 rows > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > | 3|0.03| > | 4|0.04| > +---++ > only showing top 5 rows{code} > *Actual result* > {code:java} > +---+--+ > | id|pr| > +---+--+ > | 0| 0.0| > | 1|0.| > | 2|0.| > +---+--+ > only showing top 3 rows > +---+---+ > | id| pr| > +---+---+ > | 0|0.0| > | 1|0.2| > | 2|0.4| > | 3|0.6| > | 4|0.8| > +---+---+ > only showing top 5 rows{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results
[ https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557160#comment-17557160 ] Apache Spark commented on SPARK-38614: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/36951 > After Spark update, df.show() shows incorrect F.percent_rank results > > > Key: SPARK-38614 > URL: https://issues.apache.org/jira/browse/SPARK-38614 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: ZygD >Priority: Major > Labels: correctness > > Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0. > *Minimal reproducible example* > {code:java} > from pyspark.sql import SparkSession, functions as F, Window as W > spark = SparkSession.builder.getOrCreate() > > df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id'))) > df.show(3) > df.show(5) {code} > *Expected result* > {code:java} > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > +---++ > only showing top 3 rows > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > | 3|0.03| > | 4|0.04| > +---++ > only showing top 5 rows{code} > *Actual result* > {code:java} > +---+--+ > | id|pr| > +---+--+ > | 0| 0.0| > | 1|0.| > | 2|0.| > +---+--+ > only showing top 3 rows > +---+---+ > | id| pr| > +---+---+ > | 0|0.0| > | 1|0.2| > | 2|0.4| > | 3|0.6| > | 4|0.8| > +---+---+ > only showing top 5 rows{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results
[ https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557159#comment-17557159 ] Apache Spark commented on SPARK-38614: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/36951 > After Spark update, df.show() shows incorrect F.percent_rank results > > > Key: SPARK-38614 > URL: https://issues.apache.org/jira/browse/SPARK-38614 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: ZygD >Priority: Major > Labels: correctness > > Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0. > *Minimal reproducible example* > {code:java} > from pyspark.sql import SparkSession, functions as F, Window as W > spark = SparkSession.builder.getOrCreate() > > df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id'))) > df.show(3) > df.show(5) {code} > *Expected result* > {code:java} > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > +---++ > only showing top 3 rows > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > | 3|0.03| > | 4|0.04| > +---++ > only showing top 5 rows{code} > *Actual result* > {code:java} > +---+--+ > | id|pr| > +---+--+ > | 0| 0.0| > | 1|0.| > | 2|0.| > +---+--+ > only showing top 3 rows > +---+---+ > | id| pr| > +---+---+ > | 0|0.0| > | 1|0.2| > | 2|0.4| > | 3|0.6| > | 4|0.8| > +---+---+ > only showing top 5 rows{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results
[ https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38614: Assignee: (was: Apache Spark) > After Spark update, df.show() shows incorrect F.percent_rank results > > > Key: SPARK-38614 > URL: https://issues.apache.org/jira/browse/SPARK-38614 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0 >Reporter: ZygD >Priority: Major > Labels: correctness > > Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0. > *Minimal reproducible example* > {code:java} > from pyspark.sql import SparkSession, functions as F, Window as W > spark = SparkSession.builder.getOrCreate() > > df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id'))) > df.show(3) > df.show(5) {code} > *Expected result* > {code:java} > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > +---++ > only showing top 3 rows > +---++ > | id| pr| > +---++ > | 0| 0.0| > | 1|0.01| > | 2|0.02| > | 3|0.03| > | 4|0.04| > +---++ > only showing top 5 rows{code} > *Actual result* > {code:java} > +---+--+ > | id|pr| > +---+--+ > | 0| 0.0| > | 1|0.| > | 2|0.| > +---+--+ > only showing top 3 rows > +---+---+ > | id| pr| > +---+---+ > | 0|0.0| > | 1|0.2| > | 2|0.4| > | 3|0.6| > | 4|0.8| > +---+---+ > only showing top 5 rows{code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: Currently, DataFrame creation from a list of scalars is unsupported as below: |>>> spark.createDataFrame([1, 2]) Traceback (most recent call last): ... *raise* TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| However, cases below are supported. |>>> spark.createDataFrame([(1,), (2,)]).collect() [Row(_1=1), Row(_1=2)]| |>>> schema StructType([StructField('_1', LongType(), \{*}True\{*})]) >>> spark.createDataFrame([1, 2], schema=schema).collect() [Row(_1=1), Row(_1=2)]| In addition, Spark DataFrame Scala API supports creating a DataFrame from a list of scalars as below: |scala> Seq(1, 2).toDF().collect() res6: Array[org.apache.spark.sql.Row] = Array([1], [2])| To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more at [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] was: Currently, DataFrame creation from a list of scalars is unsupported as below: |>>> spark.createDataFrame([1, 2]) Traceback (most recent call last): ... *raise* TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| However, cases below are supported. |>>> spark.createDataFrame([(1,), (2,)]).collect() [Row(_1=1), Row(_1=2)]| |>>> schema StructType([StructField('_1', LongType(), {*}True{*})]) >>> spark.createDataFrame([1, 2], schema=schema).collect() [Row(_1=1), Row(_1=2)]| In addition, Spark DataFrame Scala API supports creating a DataFrame from a list of scalars as below: |scala> Seq(1, 2).toDF().collect() res6: Array[org.apache.spark.sql.Row] = Array([1], [2]| To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more at [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] > Support `createDataFrame` from a list of scalars > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of scalars is unsupported as below: > |>>> spark.createDataFrame([1, 2]) > Traceback (most recent call last): > ... > *raise* TypeError("Can not infer schema for type: %s" % type(row)) > TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| > > However, cases below are supported. > |>>> spark.createDataFrame([(1,), (2,)]).collect() > [Row(_1=1), Row(_1=2)]| > > |>>> schema > StructType([StructField('_1', LongType(), \{*}True\{*})]) > >>> spark.createDataFrame([1, 2], schema=schema).collect() > [Row(_1=1), Row(_1=2)]| > > In addition, Spark DataFrame Scala API supports creating a DataFrame from a > list of scalars as below: > |scala> Seq(1, 2).toDF().collect() > res6: Array[org.apache.spark.sql.Row] = Array([1], [2])| > > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. See more at > [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39496) Inline eval path cannot handle null structs
[ https://issues.apache.org/jira/browse/SPARK-39496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39496: - Fix Version/s: 3.1.3 > Inline eval path cannot handle null structs > --- > > Key: SPARK-39496 > URL: https://issues.apache.org/jira/browse/SPARK-39496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.1.3, 3.2.2, 3.4.0, 3.3.1 > > > This issue is somewhat similar to SPARK-39061, but for the eval path rather > than the codegen path. > Example: > {noformat} > set spark.sql.codegen.wholeStage=false; > select inline(array(named_struct('a', 1, 'b', 2), null)); > {noformat} > This results in a NullPointerException: > {noformat} > 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) > {noformat} > The next example doesn't require setting {{spark.sql.codegen.wholeStage}} to > {{{}false{}}}: > {noformat} > val dfWide = (Seq((1)) > .toDF("col0") > .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*)) > val df = (dfWide > .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as > struct_array")) > df.selectExpr("*", "inline(struct_array)").collect > {noformat} > The result is similar: > {noformat} > 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ > 1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557155#comment-17557155 ] Hyukjin Kwon commented on SPARK-39519: -- Thanks for your investigation. > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars
[ https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39494: - Description: Currently, DataFrame creation from a list of scalars is unsupported as below: |>>> spark.createDataFrame([1, 2]) Traceback (most recent call last): ... *raise* TypeError("Can not infer schema for type: %s" % type(row)) TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| However, cases below are supported. |>>> spark.createDataFrame([(1,), (2,)]).collect() [Row(_1=1), Row(_1=2)]| |>>> schema StructType([StructField('_1', LongType(), {*}True{*})]) >>> spark.createDataFrame([1, 2], schema=schema).collect() [Row(_1=1), Row(_1=2)]| In addition, Spark DataFrame Scala API supports creating a DataFrame from a list of scalars as below: |scala> Seq(1, 2).toDF().collect() res6: Array[org.apache.spark.sql.Row] = Array([1], [2]| To maintain API consistency, we propose to support DataFrame creation from a list of scalars. See more at [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] was: - Support `createDataFrame` from a list of scalars. - Standardize error messages when the input list contains any scalars. > Support `createDataFrame` from a list of scalars > > > Key: SPARK-39494 > URL: https://issues.apache.org/jira/browse/SPARK-39494 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Currently, DataFrame creation from a list of scalars is unsupported as below: > |>>> spark.createDataFrame([1, 2]) > Traceback (most recent call last): > ... > *raise* TypeError("Can not infer schema for type: %s" % type(row)) > TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>| > > However, cases below are supported. > |>>> spark.createDataFrame([(1,), (2,)]).collect() > [Row(_1=1), Row(_1=2)]| > > |>>> schema > StructType([StructField('_1', LongType(), {*}True{*})]) > >>> spark.createDataFrame([1, 2], schema=schema).collect() > [Row(_1=1), Row(_1=2)]| > > In addition, Spark DataFrame Scala API supports creating a DataFrame from a > list of scalars as below: > |scala> Seq(1, 2).toDF().collect() > res6: Array[org.apache.spark.sql.Row] = Array([1], [2]| > > To maintain API consistency, we propose to support DataFrame creation from a > list of scalars. See more at > [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing] > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
[ https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-39550: - Description: When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns an Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instead. was: When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns a Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instad. > Fix `MultiIndex.value_counts()` when Arrow Execution is enabled > --- > > Key: SPARK-39550 > URL: https://issues.apache.org/jira/browse/SPARK-39550 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > > When Arrow Execution is enabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'true' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 > {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 > dtype: int64 > {code} > When Arrow Execution is disabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'false' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > (1, a) 1 > (2, b) 1 > dtype: int64 {code} > Notice how indexes of their results are different. > Especially, `value_counts` returns an Index (rather than a MultiIndex), under > the hood, a Spark column of StructType (rather than multiple Spark columns), > so when Arrow Execution is enabled, Arrow converts the StructType column to a > dictionary, where we expect a tuple instead. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
[ https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557145#comment-17557145 ] Xinrong Meng commented on SPARK-39550: -- I am working on that. > Fix `MultiIndex.value_counts()` when Arrow Execution is enabled > --- > > Key: SPARK-39550 > URL: https://issues.apache.org/jira/browse/SPARK-39550 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > > When Arrow Execution is enabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'true' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 > {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 > dtype: int64 > {code} > When Arrow Execution is disabled, > {code:java} > >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") > 'false' > >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() > (1, a) 1 > (2, b) 1 > dtype: int64 {code} > Notice how indexes of their results are different. > Especially, `value_counts` returns an Index (rather than a MultiIndex), under > the hood, a Spark column of StructType (rather than multiple Spark columns), > so when Arrow Execution is enabled, Arrow converts the StructType column to a > dictionary, where we expect a tuple instead. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
Xinrong Meng created SPARK-39550: Summary: Fix `MultiIndex.value_counts()` when Arrow Execution is enabled Key: SPARK-39550 URL: https://issues.apache.org/jira/browse/SPARK-39550 Project: Spark Issue Type: Bug Components: Pandas API on Spark, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng When Arrow Execution is enabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'true' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() {'__index_level_0__': 1, '__index_level_1__': 'a'} 1 {'__index_level_0__': 2, '__index_level_1__': 'b'} 1 dtype: int64 {code} When Arrow Execution is disabled, {code:java} >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled") 'false' >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts() (1, a) 1 (2, b) 1 dtype: int64 {code} Notice how indexes of their results are different. Especially, `value_counts` returns a Index (rather than a MultiIndex), under the hood, a Spark column of StructType (rather than multiple Spark columns), so when Arrow Execution is enabled, Arrow converts the StructType column to a dictionary, where we expect a tuple instad. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39549) How to get access to the data created in different Spark Applications
[ https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chenyang Zhang updated SPARK-39549: --- Priority: Major (was: Critical) > How to get access to the data created in different Spark Applications > - > > Key: SPARK-39549 > URL: https://issues.apache.org/jira/browse/SPARK-39549 > Project: Spark > Issue Type: Question > Components: Pandas API on Spark, Project Infra, PySpark >Affects Versions: 3.3.0 >Reporter: Chenyang Zhang >Priority: Major > > I am working on a project using PySpark and I am blocked because I want to > share data between different Spark applications. The situation is that we > have a running java server which can handles incoming requests with a thread > pool, and each thread has a corresponding python process. We want to use > pandas on Spark, but have it so that any of the python processes can access > the same data in spark. For example, in a python process, we created a > SparkSession, read some data, modified the data using pandas on Spark api and > we want to get access to that data in a different python process. The core > problem is how to share data between different SparkSession or how to let > different python process connect to the same SparkSession. I researched a bit > but it seems impossible to share data between different python process > without using external DB or connect to the same SparkSession. Generally, is > this possible / what would be the recommended way to do this with the least > impact on performance? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39549) How to get access to the data created in different Spark Applications
Chenyang Zhang created SPARK-39549: -- Summary: How to get access to the data created in different Spark Applications Key: SPARK-39549 URL: https://issues.apache.org/jira/browse/SPARK-39549 Project: Spark Issue Type: Question Components: Pandas API on Spark, Project Infra, PySpark Affects Versions: 3.3.0 Reporter: Chenyang Zhang I am working on a project using PySpark and I am blocked because I want to share data between different Spark applications. The situation is that we have a running java server which can handles incoming requests with a thread pool, and each thread has a corresponding python process. We want to use pandas on Spark, but have it so that any of the python processes can access the same data in spark. For example, in a python process, we created a SparkSession, read some data, modified the data using pandas on Spark api and we want to get access to that data in a different python process. The core problem is how to share data between different SparkSession or how to let different python process connect to the same SparkSession. I researched a bit but it seems impossible to share data between different python process without using external DB or connect to the same SparkSession. Generally, is this possible / what would be the recommended way to do this with the least impact on performance? -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38796) Implement the to_number and try_to_number SQL functions according to a new specification
[ https://issues.apache.org/jira/browse/SPARK-38796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557130#comment-17557130 ] Apache Spark commented on SPARK-38796: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/36950 > Implement the to_number and try_to_number SQL functions according to a new > specification > > > Key: SPARK-38796 > URL: https://issues.apache.org/jira/browse/SPARK-38796 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.3.0 > > > This tracks implementing the 'to_number' and 'try_to_number' SQL function > expressions according to new semantics described below. The former is > equivalent to the latter except that it throws an exception instead of > returning NULL for cases where the input string does not match the format > string. > > --- > > *try_to_number function (expr, fmt):* > Returns 'expr' cast to DECIMAL using formatting 'fmt', or 'NULL' if 'expr' is > not a valid match for the given format. > > Syntax: > [ S ] [ L | $ ] > [ 0 | 9 | G | , ] [...] > [ . | D ] > [ 0 | 9 ] [...] > [ L | $ ] [ PR | MI | S ] ' } > > *Arguments:* > 'expr': A STRING expression representing a number. 'expr' may include leading > or trailing spaces. > 'fmt': An STRING literal, specifying the expected format of 'expr'. > > *Returns:* > A DECIMAL(p, s) where 'p' is the total number of digits ('0' or '9') and 's' > is the number of digits after the decimal point, or 0 if there is none. > > *Format elements allowed (case insensitive):* > * 0 or 9 > Specifies an expected digit between '0' and '9'. > A '0' to the left of the decimal points indicates that 'expr' must have at > least as many digits. A leading '9' indicates that 'expr' may omit these > digits. > 'expr' must not be larger than the number of digits to the left of the > decimal point allowed by the format string. > Digits to the right of the decimal point in the format string indicate the > most digits that 'expr' may have to the right of the decimal point. > * . or D > Specifies the position of the decimal point. > 'expr' does not need to include a decimal point. > * , or G > Specifies the position of the ',' grouping (thousands) separator. > There must be a '0' or '9' to the left of the rightmost grouping separator. > 'expr' must match the grouping separator relevant for the size of the > number. > * $ > Specifies the location of the '$' currency sign. This character may only be > specified once. > * S > Specifies the position of an option '+' or '-' sign. This character may > only be specified once. > * MI > Specifies that 'expr' has an optional '-' sign at the end, but no '+'. > * PR > Specifies that 'expr' indicates a negative number with wrapping angled > brackets ('<1>'). If 'expr' contains any characters other than '0' through > '9' and those permitted in 'fmt' a 'NULL' is returned. > > *Examples:* > {{– The format expects:}} > {{– * an optional sign at the beginning,}} > {{– * followed by a dollar sign,}} > {{– * followed by a number between 3 and 6 digits long,}} > {{– * thousands separators,}} > {{– * up to two dights beyond the decimal point. }} > {{> SELECT try_to_number('-$12,345.67', 'S$999,099.99');}} > {{ -12345.67}} > {{– The plus sign is optional, and so are fractional digits.}} > {{> SELECT try_to_number('$345', 'S$999,099.99');}} > {{ 345.00}} > {{– The format requires at least three digits.}} > {{> SELECT try_to_number('$45', 'S$999,099.99');}} > {{ NULL}} > {{– The format requires at least three digits.}} > {{> SELECT try_to_number('$045', 'S$999,099.99');}} > {{ 45.00}} > {{– Using brackets to denote negative values}} > {{> SELECT try_to_number('<1234>', '99PR');}} > {{ -1234}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39496) Inline eval path cannot handle null structs
[ https://issues.apache.org/jira/browse/SPARK-39496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557128#comment-17557128 ] Apache Spark commented on SPARK-39496: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/36949 > Inline eval path cannot handle null structs > --- > > Key: SPARK-39496 > URL: https://issues.apache.org/jira/browse/SPARK-39496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1 > > > This issue is somewhat similar to SPARK-39061, but for the eval path rather > than the codegen path. > Example: > {noformat} > set spark.sql.codegen.wholeStage=false; > select inline(array(named_struct('a', 1, 'b', 2), null)); > {noformat} > This results in a NullPointerException: > {noformat} > 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) > {noformat} > The next example doesn't require setting {{spark.sql.codegen.wholeStage}} to > {{{}false{}}}: > {noformat} > val dfWide = (Seq((1)) > .toDF("col0") > .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*)) > val df = (dfWide > .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as > struct_array")) > df.selectExpr("*", "inline(struct_array)").collect > {noformat} > The result is similar: > {noformat} > 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ > 1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39496) Inline eval path cannot handle null structs
[ https://issues.apache.org/jira/browse/SPARK-39496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557125#comment-17557125 ] Apache Spark commented on SPARK-39496: -- User 'bersprockets' has created a pull request for this issue: https://github.com/apache/spark/pull/36949 > Inline eval path cannot handle null structs > --- > > Key: SPARK-39496 > URL: https://issues.apache.org/jira/browse/SPARK-39496 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Fix For: 3.2.2, 3.4.0, 3.3.1 > > > This issue is somewhat similar to SPARK-39061, but for the eval path rather > than the codegen path. > Example: > {noformat} > set spark.sql.codegen.wholeStage=false; > select inline(array(named_struct('a', 1, 'b', 2), null)); > {noformat} > This results in a NullPointerException: > {noformat} > 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) > {noformat} > The next example doesn't require setting {{spark.sql.codegen.wholeStage}} to > {{{}false{}}}: > {noformat} > val dfWide = (Seq((1)) > .toDF("col0") > .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*)) > val df = (dfWide > .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as > struct_array")) > df.selectExpr("*", "inline(struct_array)").collect > {noformat} > The result is similar: > {noformat} > 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ > 1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122) > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue
[ https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557119#comment-17557119 ] Apache Spark commented on SPARK-39548: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/36947 > CreateView Command with a window clause query hit a wrong window definition > not found issue > --- > > Key: SPARK-39548 > URL: https://issues.apache.org/jira/browse/SPARK-39548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > This query will hit a w2 window definition not found in `WindowSubstitute` > rule, however this is a bug since the w2 definition is defined in the query. > ``` > create or replace temporary view test_temp_view as > with step_1 as ( > select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as > c) window w2 as (partition by b order by c)) , step_2 as > ( > select *, max(e) over w1 as max_a_over_w1 > from (select 1 as e, 2 as f, 3 as g) > join step_1 on true > window w1 as (partition by f order by g) > ) > select * > from step_2 > ``` > Also we can move the unresolved window expression check from > `WindowSubstitute` rule to `CheckAnalysis` phrase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue
[ https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39548: Assignee: Apache Spark > CreateView Command with a window clause query hit a wrong window definition > not found issue > --- > > Key: SPARK-39548 > URL: https://issues.apache.org/jira/browse/SPARK-39548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > > This query will hit a w2 window definition not found in `WindowSubstitute` > rule, however this is a bug since the w2 definition is defined in the query. > ``` > create or replace temporary view test_temp_view as > with step_1 as ( > select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as > c) window w2 as (partition by b order by c)) , step_2 as > ( > select *, max(e) over w1 as max_a_over_w1 > from (select 1 as e, 2 as f, 3 as g) > join step_1 on true > window w1 as (partition by f order by g) > ) > select * > from step_2 > ``` > Also we can move the unresolved window expression check from > `WindowSubstitute` rule to `CheckAnalysis` phrase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue
[ https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39548: Assignee: (was: Apache Spark) > CreateView Command with a window clause query hit a wrong window definition > not found issue > --- > > Key: SPARK-39548 > URL: https://issues.apache.org/jira/browse/SPARK-39548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > This query will hit a w2 window definition not found in `WindowSubstitute` > rule, however this is a bug since the w2 definition is defined in the query. > ``` > create or replace temporary view test_temp_view as > with step_1 as ( > select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as > c) window w2 as (partition by b order by c)) , step_2 as > ( > select *, max(e) over w1 as max_a_over_w1 > from (select 1 as e, 2 as f, 3 as g) > join step_1 on true > window w1 as (partition by f order by g) > ) > select * > from step_2 > ``` > Also we can move the unresolved window expression check from > `WindowSubstitute` rule to `CheckAnalysis` phrase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue
[ https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39548: - Summary: CreateView Command with a window clause query hit a wrong window definition not found issue (was: CreateView Command with a window clause query hit a wrong window definition not found issue.) > CreateView Command with a window clause query hit a wrong window definition > not found issue > --- > > Key: SPARK-39548 > URL: https://issues.apache.org/jira/browse/SPARK-39548 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > This query will hit a w2 window definition not found in `WindowSubstitute` > rule, however this is a bug since the w2 definition is defined in the query. > ``` > create or replace temporary view test_temp_view as > with step_1 as ( > select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as > c) window w2 as (partition by b order by c)) , step_2 as > ( > select *, max(e) over w1 as max_a_over_w1 > from (select 1 as e, 2 as f, 3 as g) > join step_1 on true > window w1 as (partition by f order by g) > ) > select * > from step_2 > ``` > Also we can move the unresolved window expression check from > `WindowSubstitute` rule to `CheckAnalysis` phrase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue.
Rui Wang created SPARK-39548: Summary: CreateView Command with a window clause query hit a wrong window definition not found issue. Key: SPARK-39548 URL: https://issues.apache.org/jira/browse/SPARK-39548 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Rui Wang This query will hit a w2 window definition not found in `WindowSubstitute` rule, however this is a bug since the w2 definition is defined in the query. ``` create or replace temporary view test_temp_view as with step_1 as ( select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as c) window w2 as (partition by b order by c)) , step_2 as ( select *, max(e) over w1 as max_a_over_w1 from (select 1 as e, 2 as f, 3 as g) join step_1 on true window w1 as (partition by f order by g) ) select * from step_2 ``` Also we can move the unresolved window expression check from `WindowSubstitute` rule to `CheckAnalysis` phrase. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39500) Ivy doesn't work correctly on IPv6-only environment
[ https://issues.apache.org/jira/browse/SPARK-39500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557101#comment-17557101 ] Erik Krogen commented on SPARK-39500: - Thanks for clarifying! > Ivy doesn't work correctly on IPv6-only environment > --- > > Key: SPARK-39500 > URL: https://issues.apache.org/jira/browse/SPARK-39500 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > Ivy doesn't work correctly on IPv6. > {code} > SparkSubmitUtils.resolveMavenCoordinates( > "org.apache.logging.log4j:log4j-api:2.17.2", > SparkSubmitUtils.buildIvySettings(None, Some("/tmp/ivy")), > transitive = true) > {code} > {code} > % bin/spark-shell > 22/06/16 22:22:12 WARN Utils: Your hostname, m1ipv6.local resolves to a > loopback address: 127.0.0.1; using 2600:1700:232e:3de0:0:0:0:b instead (on > interface en0) > 22/06/16 22:22:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to > another address > = https://ipv6.repo1.maven.org/maven2/ > =https://maven-central.storage-download.googleapis.com/maven2/ > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 22/06/16 22:22:14 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > Spark context Web UI available at http://unknown1498776019fa.attlocal.net:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1655443334687). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.4.0-SNAPSHOT > /_/ > Using Scala version 2.12.16 (OpenJDK 64-Bit Server VM, Java 17.0.3) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :paste -raw > // Entering paste mode (ctrl-D to finish) > package org.apache.spark.deploy > object Download { > SparkSubmitUtils.resolveMavenCoordinates( > "org.apache.logging.log4j:log4j-api:2.17.2", > SparkSubmitUtils.buildIvySettings(None, Some("/tmp/ivy")), > transitive = true) > } > // Exiting paste mode, now interpreting. > scala> org.apache.spark.deploy.Download > = https://ipv6.repo1.maven.org/maven2/ > =https://maven-central.storage-download.googleapis.com/maven2/ > :: loading settings :: url = > jar:file:/Users/dongjoon/APACHE/spark/assembly/target/scala-2.12/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml > Ivy Default Cache set to: /tmp/ivy/cache > The jars for the packages stored in: /tmp/ivy/jars > org.apache.logging.log4j#log4j-api added as a dependency > :: resolving dependencies :: > org.apache.spark#spark-submit-parent-f47b503f-897e-4b92-95da-3806c32c220f;1.0 > confs: [default] > :: resolution report :: resolve 95ms :: artifacts dl 0ms > :: modules in use: > - > | |modules|| artifacts | > | conf | number| search|dwnlded|evicted|| number|dwnlded| > - > | default | 1 | 0 | 0 | 0 || 0 | 0 | > - > :: problems summary :: > WARNINGS > module not found: org.apache.logging.log4j#log4j-api;2.17.2 > local-m2-cache: tried > > file:/Users/dongjoon/.m2/repository/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.pom > -- artifact org.apache.logging.log4j#log4j-api;2.17.2!log4j-api.jar: > > file:/Users/dongjoon/.m2/repository/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.jar > local-ivy-cache: tried > > /tmp/ivy/local/org.apache.logging.log4j/log4j-api/2.17.2/ivys/ivy.xml > -- artifact org.apache.logging.log4j#log4j-api;2.17.2!log4j-api.jar: > > /tmp/ivy/local/org.apache.logging.log4j/log4j-api/2.17.2/jars/log4j-api.jar > ipv6: tried > > https://ipv6.repo1.maven.org/maven2/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.pom > -- artifact org.apache.logging.log4j#log4j-api;2.17.2!log4j-api.jar: > > https://ipv6.repo1.maven.org/maven2/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.jar > central: tried > > https://maven-central.storage-download.googleapis.com/maven2/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.pom > -- artifact
[jira] [Commented] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
[ https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557062#comment-17557062 ] Apache Spark commented on SPARK-39547: -- User 'singhpk234' has created a pull request for this issue: https://github.com/apache/spark/pull/36948 > V2SessionCatalog should not throw NoSuchDatabaseException in > loadNamespaceMetadata > -- > > Key: SPARK-39547 > URL: https://issues.apache.org/jira/browse/SPARK-39547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prashant Singh >Priority: Minor > > DROP NAMESPACE IF EXISTS > {table} > > if a catalog doesn't overrides `namespaceExists` it by default uses > `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata > throws a `NoSuchDatabaseException` which is not catched and we see failures > even with `if exists` clause. One such use case we observed was in iceberg > table a post test clean up was failing with `NoSuchDatabaseException` now. > > Found {color:#00}V2SessionCatalog > `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` > {color}was also throwing the same unlike > `{color:#00}JDBCTableCatalog`{color} > ref a stack trace : > {quote}Database 'db' not found > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' > not found > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284) > at > org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247) > at > org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97) > at > org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98) > at > org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
[ https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39547: Assignee: Apache Spark > V2SessionCatalog should not throw NoSuchDatabaseException in > loadNamespaceMetadata > -- > > Key: SPARK-39547 > URL: https://issues.apache.org/jira/browse/SPARK-39547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prashant Singh >Assignee: Apache Spark >Priority: Minor > > DROP NAMESPACE IF EXISTS > {table} > > if a catalog doesn't overrides `namespaceExists` it by default uses > `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata > throws a `NoSuchDatabaseException` which is not catched and we see failures > even with `if exists` clause. One such use case we observed was in iceberg > table a post test clean up was failing with `NoSuchDatabaseException` now. > > Found {color:#00}V2SessionCatalog > `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` > {color}was also throwing the same unlike > `{color:#00}JDBCTableCatalog`{color} > ref a stack trace : > {quote}Database 'db' not found > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' > not found > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284) > at > org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247) > at > org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97) > at > org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98) > at > org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
[ https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39547: Assignee: (was: Apache Spark) > V2SessionCatalog should not throw NoSuchDatabaseException in > loadNamespaceMetadata > -- > > Key: SPARK-39547 > URL: https://issues.apache.org/jira/browse/SPARK-39547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prashant Singh >Priority: Minor > > DROP NAMESPACE IF EXISTS > {table} > > if a catalog doesn't overrides `namespaceExists` it by default uses > `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata > throws a `NoSuchDatabaseException` which is not catched and we see failures > even with `if exists` clause. One such use case we observed was in iceberg > table a post test clean up was failing with `NoSuchDatabaseException` now. > > Found {color:#00}V2SessionCatalog > `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` > {color}was also throwing the same unlike > `{color:#00}JDBCTableCatalog`{color} > ref a stack trace : > {quote}Database 'db' not found > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' > not found > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284) > at > org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247) > at > org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97) > at > org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98) > at > org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
[ https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557063#comment-17557063 ] Apache Spark commented on SPARK-39547: -- User 'singhpk234' has created a pull request for this issue: https://github.com/apache/spark/pull/36948 > V2SessionCatalog should not throw NoSuchDatabaseException in > loadNamespaceMetadata > -- > > Key: SPARK-39547 > URL: https://issues.apache.org/jira/browse/SPARK-39547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prashant Singh >Priority: Minor > > DROP NAMESPACE IF EXISTS > {table} > > if a catalog doesn't overrides `namespaceExists` it by default uses > `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata > throws a `NoSuchDatabaseException` which is not catched and we see failures > even with `if exists` clause. One such use case we observed was in iceberg > table a post test clean up was failing with `NoSuchDatabaseException` now. > > Found {color:#00}V2SessionCatalog > `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` > {color}was also throwing the same unlike > `{color:#00}JDBCTableCatalog`{color} > ref a stack trace : > {quote}Database 'db' not found > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' > not found > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284) > at > org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247) > at > org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97) > at > org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98) > at > org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
Prashant Singh created SPARK-39547: -- Summary: V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata Key: SPARK-39547 URL: https://issues.apache.org/jira/browse/SPARK-39547 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: Prashant Singh DROP NAMESPACE IF EXISTS {table} if a catalog doesn't overrides `namespaceExists` it by default uses `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata throws a `NoSuchDatabaseException` which is not catched and we see failures even with `if exists` clause. One such use case we observed was in iceberg table a post test clean up was failing with `NoSuchDatabaseException` now. Found {color:#00}V2SessionCatalog `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` {color}was also throwing the same unlike `{color:#00}JDBCTableCatalog`{color} ref a stack trace : {quote}Database 'db' not found org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' not found at org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284) at org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247) at org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97) at org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98) at org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) at org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata
[ https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557056#comment-17557056 ] Prashant Singh commented on SPARK-39547: will post a pr for it shortly. > V2SessionCatalog should not throw NoSuchDatabaseException in > loadNamespaceMetadata > -- > > Key: SPARK-39547 > URL: https://issues.apache.org/jira/browse/SPARK-39547 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prashant Singh >Priority: Minor > > DROP NAMESPACE IF EXISTS > {table} > > if a catalog doesn't overrides `namespaceExists` it by default uses > `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata > throws a `NoSuchDatabaseException` which is not catched and we see failures > even with `if exists` clause. One such use case we observed was in iceberg > table a post test clean up was failing with `NoSuchDatabaseException` now. > > Found {color:#00}V2SessionCatalog > `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` > {color}was also throwing the same unlike > `{color:#00}JDBCTableCatalog`{color} > ref a stack trace : > {quote}Database 'db' not found > org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' > not found > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284) > at > org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247) > at > org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97) > at > org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98) > at > org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43) > at > org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49) > at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > {quote} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan
[ https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-38647. -- Fix Version/s: 3.4.0 Assignee: Enrico Minack Resolution: Fixed > Add SupportsReportOrdering mix in interface for Scan > > > Key: SPARK-38647 > URL: https://issues.apache.org/jira/browse/SPARK-38647 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.0 >Reporter: Enrico Minack >Assignee: Enrico Minack >Priority: Major > Fix For: 3.4.0 > > > As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide > Spark with information about the exiting partitioning of data read by a > {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} > should provide order information. > This prevents Spark from sorting data if they already exhibit a certain order > provided by the source. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39542) Improve YARN client mode to support IPv6
[ https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-39542. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36939 [https://github.com/apache/spark/pull/36939] > Improve YARN client mode to support IPv6 > > > Key: SPARK-39542 > URL: https://issues.apache.org/jira/browse/SPARK-39542 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39542) Improve YARN client mode to support IPv6
[ https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-39542: - Assignee: Dongjoon Hyun > Improve YARN client mode to support IPv6 > > > Key: SPARK-39542 > URL: https://issues.apache.org/jira/browse/SPARK-39542 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor
Oliver Koeth created SPARK-39546: Summary: Respect port defininitions on K8S pod templates for both driver and executor Key: SPARK-39546 URL: https://issues.apache.org/jira/browse/SPARK-39546 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.3.0 Reporter: Oliver Koeth *Description:* Spark on K8S allows to open additional ports for custom purposes on the driver pod via the pod template, but ignores the port specification in the executor pod template. Port specifications from the pod template should be preserved (and extended) for both drivers and executors. *Scenario:* I want to run functionality in the executor that exposes data on an additional port. In my case, this is monitoring data exposed by Spark's JMX metrics sink via the JMX prometheus exporter java agent https://github.com/prometheus/jmx_exporter -- the java agent opens an extra port inside the container, but for prometheus to detect and scrape the port, it must be exposed in the K8S pod resource. (More background if desired: This seems to be the "classic" Spark 2 way to expose prometheus metrics. Spark 3 introduced a native equivalent servlet for the driver, but for the executor, only a rather limited set of metrics is forwarded via the driver, and that also follows a completely different naming scheme. So the JMX + exporter approach still turns out to be more useful for me, even in Spark 3) Expected behavior: I add the following to my pod template to expose the extra port opened by the JMX exporter java agent spec: containers: - ... ports: - containerPort: 8090 name: jmx-prometheus protocol: TCP Observed behavior: The port is exposed for driver pods but not for executor pods *Corresponding code:* driver pod creation just adds ports [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala] (currently line 115) val driverContainer = new ContainerBuilder(pod.container) ... .addNewPort() ... .addNewPort() while executor pod creation replaces the ports [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala] (currently line 211) val executorContainer = new ContainerBuilder(pod.container) ... .withPorts(requiredPorts.asJava) The current handling is incosistent and unnecessarily limiting. It seems that the executor creation could/should just as well preserve pods from the template and add extra required ports. *Workaround:* It is possible to work around this limitation by adding a full sidecar container to the executor pod spec which declares the port. Sidecar containers are left unchanged by pod template handling. As all containers in a pod share the same network, it does not matter which container actually declares to expose the port. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39340) DS v2 agg pushdown should allow dots in the name of top-level columns
[ https://issues.apache.org/jira/browse/SPARK-39340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556982#comment-17556982 ] Apache Spark commented on SPARK-39340: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/36945 > DS v2 agg pushdown should allow dots in the name of top-level columns > - > > Key: SPARK-39340 > URL: https://issues.apache.org/jira/browse/SPARK-39340 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.2 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37939) Use error classes in the parsing errors of properties
[ https://issues.apache.org/jira/browse/SPARK-37939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-37939: - Fix Version/s: 3.3.1 > Use error classes in the parsing errors of properties > - > > Key: SPARK-37939 > URL: https://issues.apache.org/jira/browse/SPARK-37939 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: panbingkun >Priority: Major > Fix For: 3.4.0, 3.3.1 > > > Migrate the following errors in QueryParsingErrors: > * cannotCleanReservedNamespacePropertyError > * cannotCleanReservedTablePropertyError > * invalidPropertyKeyForSetQuotedConfigurationError > * invalidPropertyValueForSetQuotedConfigurationError > * propertiesAndDbPropertiesBothSpecifiedError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39195) Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status
[ https://issues.apache.org/jira/browse/SPARK-39195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556916#comment-17556916 ] Apache Spark commented on SPARK-39195: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/36943 > Spark OutputCommitCoordinator should abort stage when committed file not > consistent with task status > > > Key: SPARK-39195 > URL: https://issues.apache.org/jira/browse/SPARK-39195 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39195) Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status
[ https://issues.apache.org/jira/browse/SPARK-39195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556915#comment-17556915 ] Apache Spark commented on SPARK-39195: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/36943 > Spark OutputCommitCoordinator should abort stage when committed file not > consistent with task status > > > Key: SPARK-39195 > URL: https://issues.apache.org/jira/browse/SPARK-39195 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556911#comment-17556911 ] Yang Jie commented on SPARK-39519: -- I will continue to investigate this issue > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556910#comment-17556910 ] Yang Jie commented on SPARK-39519: -- [~hyukjin.kwon] Sorry, I think we should reopen this issue, from the memory dump as follows I found that `byte[]` occupies the most memory. Its content is 'X'. From this point , the most suspicious is still `SPARK-39387: BytesColumnVector should not throw RuntimeException due to overflow` !image-2022-06-21-21-26-06-586.png! !image-2022-06-21-21-26-26-563.png! !image-2022-06-21-21-26-38-146.png! > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39519: - Attachment: image-2022-06-21-21-26-26-563.png > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39519: - Attachment: image-2022-06-21-21-26-38-146.png > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39519: - Attachment: image-2022-06-21-21-26-06-586.png > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png, > image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, > image-2022-06-21-21-26-38-146.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39519: - Attachment: image-2022-06-21-21-25-35-951.png > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > Attachments: image-2022-06-21-21-25-35-951.png > > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11
[ https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556896#comment-17556896 ] Yang Jie commented on SPARK-39519: -- I get a oom dump and will analyze it later > Test failure in SPARK-39387 with JDK 11 > --- > > Key: SPARK-39519 > URL: https://issues.apache.org/jira/browse/SPARK-39519 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Yang Jie >Priority: Major > > {code} > [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due > to overflow *** FAILED *** (3 seconds, 393 milliseconds) > [info] org.apache.spark.SparkException: Job aborted. > [info] at > org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593) > [info] at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279) > [info] at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) > [info] at > org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171) > [info] at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95) > [info] at > org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779) > [info] at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) > [info] at > org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > {code} > https://github.com/apache/spark/runs/6919076419?check_suite_focus=true -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
[ https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556894#comment-17556894 ] Apache Spark commented on SPARK-39545: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/36942 > Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the > performance > - > > Key: SPARK-39545 > URL: https://issues.apache.org/jira/browse/SPARK-39545 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > ExpressionSet ++ with -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
[ https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39545: Assignee: Apache Spark > Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the > performance > - > > Key: SPARK-39545 > URL: https://issues.apache.org/jira/browse/SPARK-39545 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > ExpressionSet ++ with -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
[ https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556893#comment-17556893 ] Apache Spark commented on SPARK-39545: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/36942 > Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the > performance > - > > Key: SPARK-39545 > URL: https://issues.apache.org/jira/browse/SPARK-39545 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > ExpressionSet ++ with -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
[ https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39545: Assignee: (was: Apache Spark) > Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the > performance > - > > Key: SPARK-39545 > URL: https://issues.apache.org/jira/browse/SPARK-39545 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > ExpressionSet ++ with -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
[ https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39545: - Description: ExpressionSet ++ with > Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the > performance > - > > Key: SPARK-39545 > URL: https://issues.apache.org/jira/browse/SPARK-39545 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > ExpressionSet ++ with -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Issue Type: Bug (was: Improvement) > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {code:java} > from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel > data_path = "/sample_multiclass_classification_data.txt" > df = spark.read.format("libsvm").load(data_path) > lr = LinearSVC(regParam=0.01) > # set the name of rawPrediction column > ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') > print(ovr.getRawPredictionCol()) > model = ovr.fit(df)model_path = 'temp' + "/ovr_model" > # save and read back in > model.write().overwrite().save(model_path) > model2 = OneVsRestModel.load(model_path) > model2.getRawPredictionCol() > Output: > raw_prediction > 'rawPrediction' {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance
Yang Jie created SPARK-39545: Summary: Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance Key: SPARK-39545 URL: https://issues.apache.org/jira/browse/SPARK-39545 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Description: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol()) model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} was: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {code:java} > from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel > data_path = "/sample_multiclass_classification_data.txt" > df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) > # set the name of rawPrediction column > ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') > print(ovr.getRawPredictionCol()) > model = ovr.fit(df)model_path = 'temp' + "/ovr_model" > # save and read back in > model.write().overwrite().save(model_path) > model2 = OneVsRestModel.load(model_path) > model2.getRawPredictionCol() > Output: > raw_prediction > 'rawPrediction' {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Description: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path) lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol()) model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} was: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol()) model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {code:java} > from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel > data_path = "/sample_multiclass_classification_data.txt" > df = spark.read.format("libsvm").load(data_path) > lr = LinearSVC(regParam=0.01) > # set the name of rawPrediction column > ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') > print(ovr.getRawPredictionCol()) > model = ovr.fit(df)model_path = 'temp' + "/ovr_model" > # save and read back in > model.write().overwrite().save(model_path) > model2 = OneVsRestModel.load(model_path) > model2.getRawPredictionCol() > Output: > raw_prediction > 'rawPrediction' {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Description: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} was: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {code:java} > from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel > data_path = "/sample_multiclass_classification_data.txt" > df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) > # set the name of rawPrediction column > ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') > print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + > "/ovr_model" > # save and read back in > model.write().overwrite().save(model_path) > model2 = OneVsRestModel.load(model_path) > model2.getRawPredictionCol() > Output: > raw_prediction > 'rawPrediction' {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Description: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {code:java} from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel data_path = "/sample_multiclass_classification_data.txt" df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) # set the name of rawPrediction column ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + "/ovr_model" # save and read back in model.write().overwrite().save(model_path) model2 = OneVsRestModel.load(model_path) model2.getRawPredictionCol() Output: raw_prediction 'rawPrediction' {code} was: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} {{data_path = "/sample_multiclass_classification_data.txt"}} {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = LinearSVC(regParam=0.01){}}} {{# set the name of rawPrediction column}} {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} {{model.write().overwrite().save(model_path)}} {{model2 = OneVsRestModel.load(model_path)}} {{model2.getRawPredictionCol()}} {{Output:}} {{raw_prediction }} {{'rawPrediction'}} > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {code:java} > from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel > data_path = "/sample_multiclass_classification_data.txt" > df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01) > # set the name of rawPrediction column > ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction') > print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + > "/ovr_model" > # save and read back in > model.write().overwrite().save(model_path) > model2 = OneVsRestModel.load(model_path) > model2.getRawPredictionCol() > Output: > raw_prediction > 'rawPrediction' {code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Description: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} {{data_path = "/sample_multiclass_classification_data.txt"}} {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = LinearSVC(regParam=0.01){}}} {{# set the name of rawPrediction column}} {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} {{model.write().overwrite().save(model_path)}} {{model2 = OneVsRestModel.load(model_path)}} {{model2.getRawPredictionCol()}} {{Output:}} {{raw_prediction }} {{'rawPrediction'}} was: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} {{data_path = "/sample_multiclass_classification_data.txt"}} {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = LinearSVC(regParam=0.01){}}} {{# set the name of rawPrediction column}} {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} {{model.write().overwrite().save(model_path)}} {{model2 = OneVsRestModel.load(model_path)}} {{model2.getRawPredictionCol()}} {{Output:}} {{raw_prediction }}{{'rawPrediction'}} > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} > {{data_path = "/sample_multiclass_classification_data.txt"}} > {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = > LinearSVC(regParam=0.01){}}} > {{# set the name of rawPrediction column}} > {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} > {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = > ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} > {{model.write().overwrite().save(model_path)}} > {{model2 = OneVsRestModel.load(model_path)}} > {{model2.getRawPredictionCol()}} > {{Output:}} > {{raw_prediction }} > {{'rawPrediction'}} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
[ https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] koba updated SPARK-39544: - Description: The naming of rawPredcitionCol in OneVsRest does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} {{data_path = "/sample_multiclass_classification_data.txt"}} {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = LinearSVC(regParam=0.01){}}} {{# set the name of rawPrediction column}} {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} {{model.write().overwrite().save(model_path)}} {{model2 = OneVsRestModel.load(model_path)}} {{model2.getRawPredictionCol()}} {{Output:}} {{raw_prediction }}{{'rawPrediction'}} was: The naming of `rawPredcitionCol` in `OneVsRest` does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {{```}} {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} {{data_path = "/sample_multiclass_classification_data.txt"}} {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = LinearSVC(regParam=0.01){}}} {{# set the name of rawPrediction column}} {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} {{model.write().overwrite().save(model_path)}} {{model2 = OneVsRestModel.load(model_path)}} {{model2.getRawPredictionCol()}} {{Output:}} {{raw_prediction }}{{'rawPrediction'}} {{```}} > setPredictionCol for OneVsRest does not persist when saving model to disk > - > > Key: SPARK-39544 > URL: https://issues.apache.org/jira/browse/SPARK-39544 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, > 3.2.1, 3.3.0 > Environment: Python 3.6 > Spark 3.2 >Reporter: koba >Priority: Major > > The naming of rawPredcitionCol in OneVsRest does not persist after saving and > loading a trained model. This becomes an issue when I try to stack multiple > One Vs Rest models in a pipeline. Code example below. > {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} > {{data_path = "/sample_multiclass_classification_data.txt"}} > {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = > LinearSVC(regParam=0.01){}}} > {{# set the name of rawPrediction column}} > {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} > {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = > ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} > {{model.write().overwrite().save(model_path)}} > {{model2 = OneVsRestModel.load(model_path)}} > {{model2.getRawPredictionCol()}} > {{Output:}} > {{raw_prediction }}{{'rawPrediction'}} > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk
koba created SPARK-39544: Summary: setPredictionCol for OneVsRest does not persist when saving model to disk Key: SPARK-39544 URL: https://issues.apache.org/jira/browse/SPARK-39544 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.3.0, 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 3.0.1, 3.0.0 Environment: Python 3.6 Spark 3.2 Reporter: koba The naming of `rawPredcitionCol` in `OneVsRest` does not persist after saving and loading a trained model. This becomes an issue when I try to stack multiple One Vs Rest models in a pipeline. Code example below. {{```}} {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}} {{data_path = "/sample_multiclass_classification_data.txt"}} {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = LinearSVC(regParam=0.01){}}} {{# set the name of rawPrediction column}} {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}} {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}} {{model.write().overwrite().save(model_path)}} {{model2 = OneVsRestModel.load(model_path)}} {{model2.getRawPredictionCol()}} {{Output:}} {{raw_prediction }}{{'rawPrediction'}} {{```}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556859#comment-17556859 ] Hyukjin Kwon commented on SPARK-38292: -- you could try to leverage the approach like https://github.com/apache/spark/pull/36294 to set empty or null vlaues as non-existent values. > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556858#comment-17556858 ] Hyukjin Kwon commented on SPARK-38292: -- can we control the options e.g., emptyValue or nullValue in CSV? > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39543: Assignee: Apache Spark > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Assignee: Apache Spark >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > spark.range(0, 100).writeTo("t1").option("compression", > "zstd").using("parquet").create > *before* > gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet > *after* > gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556857#comment-17556857 ] Apache Spark commented on SPARK-39543: -- User 'Yikf' has created a pull request for this issue: https://github.com/apache/spark/pull/36941 > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > spark.range(0, 100).writeTo("t1").option("compression", > "zstd").using("parquet").create > *before* > gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet > *after* > gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39543: Assignee: (was: Apache Spark) > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > spark.range(0, 100).writeTo("t1").option("compression", > "zstd").using("parquet").create > *before* > gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet > *after* > gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-39543: - Description: The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats, example: spark.range(0, 100).writeTo("t1").option("compression", "zstd").using("parquet").create *before* gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet *after* gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... was: The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats, example: *before* *after* `spark.range(0, 100).writeTo("t1").option("compression", "zstd").using("parquet").create` gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > spark.range(0, 100).writeTo("t1").option("compression", > "zstd").using("parquet").create > *before* > gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet > *after* > gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv
[ https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556851#comment-17556851 ] pralabhkumar commented on SPARK-38292: -- [~itholic] [~hyukjin.kwon] Would like to discuss the logic The difference comes na_filter = False , when there are missing values . For .eg 22,,1980-09-26 33,,1980-09-26 Pandas with na_filter , read it as its . However Spark will read missing value with null . This happens because of univocity-parsers , which reads missing value as null . Approach in case of na_filter. Once file is read in namespace.py via reader.csv(patj) , replace missing values with empty string (df.fillna("")). We also need to change the datatype of the column to string (as panda does). Please let me know , if its correct direction , i'll create a PR . > Support `na_filter` for pyspark.pandas.read_csv > --- > > Key: SPARK-38292 > URL: https://issues.apache.org/jira/browse/SPARK-38292 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Haejoon Lee >Priority: Major > > pandas support `na_filter` parameter for `read_csv` function. > (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) > We also want to support this to follow the behavior of pandas. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-39543: - Description: The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats, example: *before* *after* `spark.range(0, 100).writeTo("t1").option("compression", "zstd").using("parquet").create` gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... was: The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats, example: *before* *after* > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > *before* > > *after* > `spark.range(0, 100).writeTo("t1").option("compression", > "zstd").using("parquet").create` > gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ... > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39515) Improve/recover scheduled jobs in GitHub Actions
[ https://issues.apache.org/jira/browse/SPARK-39515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39515: Assignee: Hyukjin Kwon > Improve/recover scheduled jobs in GitHub Actions > > > Key: SPARK-39515 > URL: https://issues.apache.org/jira/browse/SPARK-39515 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Blocker > > There are five problems to address. > *First*, the scheduled jobs are broken as below: > https://github.com/apache/spark/actions/runs/2513261706 > https://github.com/apache/spark/actions/runs/2512750310 > https://github.com/apache/spark/actions/runs/2509238648 > https://github.com/apache/spark/actions/runs/2508246903 > https://github.com/apache/spark/actions/runs/2507327914 > https://github.com/apache/spark/actions/runs/2506654808 > https://github.com/apache/spark/actions/runs/2506143939 > https://github.com/apache/spark/actions/runs/2502449498 > https://github.com/apache/spark/actions/runs/2501400490 > https://github.com/apache/spark/actions/runs/2500407628 > https://github.com/apache/spark/actions/runs/2499722093 > https://github.com/apache/spark/actions/runs/2499196539 > https://github.com/apache/spark/actions/runs/2496544415 > https://github.com/apache/spark/actions/runs/2495444227 > https://github.com/apache/spark/actions/runs/2493402272 > https://github.com/apache/spark/actions/runs/2492759618 > https://github.com/apache/spark/actions/runs/2492227816 > See also https://github.com/apache/spark/pull/36899 or > https://github.com/apache/spark/pull/36890 > In the master branch, seems like at least Hadoop 2 build is broken currently. > *Second*, it is very difficult to navigate scheduled jobs now. We should use > https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule > link and manually search one by one. > Since GitHub added the feature to import other workflow, we should leverage > this feature, see also > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test_ansi.yml > and https://docs.github.com/en/actions/using-workflows/reusing-workflows. > Once we can separate them, it will be defined as a separate workflow. > Namely, each scheduled job should be classified under "All workflows" at > https://github.com/apache/spark/actions so other developers can easily track > them. > *Third*, we should set the scheduled jobs for branch-3.3, see also > https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L78-L83 > for branch-3.2 job. > *Forth*, we should improve duplicated test skipping logic. See also > https://github.com/apache/spark/pull/36413#issuecomment-1157205469 and > https://github.com/apache/spark/pull/36888 > *Fifth*, we should probably replace the base image > (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302, > https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain > ubunto image w/ Docker image cache. See also > https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39074) Fail on uploading test files, not when downloading them
[ https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39074: - Parent: (was: SPARK-39515) Issue Type: Bug (was: Sub-task) > Fail on uploading test files, not when downloading them > --- > > Key: SPARK-39074 > URL: https://issues.apache.org/jira/browse/SPARK-39074 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Minor > > The CI workflow "Report test results" fails when there are no artifacts to be > downloaded from the triggering workflow. In some situations, the triggering > workflow is not skipped, but all test jobs are skipped in case no code > changes are detected. > In that situation, no test files are uploaded, which makes the triggered > workflow fail. > Downloading no test files can have two reasons: > 1. No tests have been executed or no test files have been generated. > 2. No code has been built and tested deliberately. > You want to be notified in the first situation to fix the CI. Therefore, CI > should fail when code is built and tests are run but no test result files are > been found. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39529) Refactor and merge all related job selection logic into precondition
[ https://issues.apache.org/jira/browse/SPARK-39529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556837#comment-17556837 ] Apache Spark commented on SPARK-39529: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36940 > Refactor and merge all related job selection logic into precondition > - > > Key: SPARK-39529 > URL: https://issues.apache.org/jira/browse/SPARK-39529 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently there are three logics that choose which build to run. > First is configure-jobs > Second is precondition > Third is the type of job (if it's scheduled or not). > We should merge all to precondition. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39529) Refactor and merge all related job selection logic into precondition
[ https://issues.apache.org/jira/browse/SPARK-39529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556838#comment-17556838 ] Apache Spark commented on SPARK-39529: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36940 > Refactor and merge all related job selection logic into precondition > - > > Key: SPARK-39529 > URL: https://issues.apache.org/jira/browse/SPARK-39529 > Project: Spark > Issue Type: Sub-task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently there are three logics that choose which build to run. > First is configure-jobs > Second is precondition > Third is the type of job (if it's scheduled or not). > We should merge all to precondition. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-39543: - Summary: The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1 (was: The option of DataFrameWriterV2 should be passed to storage Properties if fallback to v1) > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > *before* > > *after* > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage Properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-39543: - Summary: The option of DataFrameWriterV2 should be passed to storage Properties if fallback to v1 (was: The option of DataFrameWriterV2 should be passed to storage Properties) > The option of DataFrameWriterV2 should be passed to storage Properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage Properties, to > support something such as compressed formats, example: > **before** > > **after** > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage Properties if fallback to v1
[ https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yikf updated SPARK-39543: - Description: The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1, to support something such as compressed formats, example: *before* *after* was: The option of DataFrameWriterV2 should be passed to storage Properties, to support something such as compressed formats, example: **before** **after** > The option of DataFrameWriterV2 should be passed to storage Properties if > fallback to v1 > > > Key: SPARK-39543 > URL: https://issues.apache.org/jira/browse/SPARK-39543 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: yikf >Priority: Minor > Fix For: 3.4.0 > > > The option of DataFrameWriterV2 should be passed to storage properties if > fallback to v1, to support something such as compressed formats, example: > *before* > > *after* > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage Properties
yikf created SPARK-39543: Summary: The option of DataFrameWriterV2 should be passed to storage Properties Key: SPARK-39543 URL: https://issues.apache.org/jira/browse/SPARK-39543 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: yikf Fix For: 3.4.0 The option of DataFrameWriterV2 should be passed to storage Properties, to support something such as compressed formats, example: **before** **after** -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39542) Improve YARN client mode to support IPv6
[ https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556818#comment-17556818 ] Apache Spark commented on SPARK-39542: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/36939 > Improve YARN client mode to support IPv6 > > > Key: SPARK-39542 > URL: https://issues.apache.org/jira/browse/SPARK-39542 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39542) Improve YARN client mode to support IPv6
[ https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39542: Assignee: (was: Apache Spark) > Improve YARN client mode to support IPv6 > > > Key: SPARK-39542 > URL: https://issues.apache.org/jira/browse/SPARK-39542 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39542) Improve YARN client mode to support IPv6
[ https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39542: Assignee: Apache Spark > Improve YARN client mode to support IPv6 > > > Key: SPARK-39542 > URL: https://issues.apache.org/jira/browse/SPARK-39542 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39542) Improve YARN client mode to support IPv6
[ https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39542: -- Component/s: PySpark > Improve YARN client mode to support IPv6 > > > Key: SPARK-39542 > URL: https://issues.apache.org/jira/browse/SPARK-39542 > Project: Spark > Issue Type: Sub-task > Components: PySpark, YARN >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39542) Improve YARN client mode to support IPv6
Dongjoon Hyun created SPARK-39542: - Summary: Improve YARN client mode to support IPv6 Key: SPARK-39542 URL: https://issues.apache.org/jira/browse/SPARK-39542 Project: Spark Issue Type: Sub-task Components: YARN Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM
[ https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556806#comment-17556806 ] liangyongyuan commented on SPARK-39541: --- I want to try to solve this problem. I already have a solution and have tested it > [Yarn] Diagnostics of yarn UI did not display the exception of driver when > driver exit before regiserAM > --- > > Key: SPARK-39541 > URL: https://issues.apache.org/jira/browse/SPARK-39541 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.3.0 >Reporter: liangyongyuan >Priority: Major > > If commit a job in yarn cluster mode and driver exited before > registerAM,Diagnostics of yarn UI did not show the exception that was throwed > by driver .Yarn UI only show : > Application application_xxx failed 1 times (global limit =10; local limit is > =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13 > > User must view spark log to find the real reason.for example,spark log shows > {code:java} > 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: > User class threw exception: java.lang.ArithmeticException: / by zero > java.lang.ArithmeticException: / by zero > at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10) > at org.examples.appErrorDemo3.main(appErrorDemo3.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736) > {code} > > The reason of this issue is that if driver would not call unregisterAM exited > before registerAM ,then yarn UI could not show the real diagnostic > information. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM
liangyongyuan created SPARK-39541: - Summary: [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM Key: SPARK-39541 URL: https://issues.apache.org/jira/browse/SPARK-39541 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 3.3.0 Reporter: liangyongyuan If commit a job in yarn cluster mode and driver exited before registerAM,Diagnostics of yarn UI did not show the exception that was throwed by driver .Yarn UI only show : Application application_xxx failed 1 times (global limit =10; local limit is =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13 User must view spark log to find the real reason.for example,spark log shows {code:java} 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: User class threw exception: java.lang.ArithmeticException: / by zero java.lang.ArithmeticException: / by zero at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10) at org.examples.appErrorDemo3.main(appErrorDemo3.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736) {code} The reason of this issue is that if driver would not call unregisterAM exited before registerAM ,then yarn UI could not show the real diagnostic information. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org