date:20220621

[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557229#comment-17557229
 ] 

Yang Jie commented on SPARK-39519:
--

The default -XX:NewRatio is 2, change it to 3 for sql/core module to enlarge 
the size of the old area maybe ok. I'm testing it.

 

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39536) to_date function is returning incorrect value

2022-06-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39536.
--
Resolution: Invalid

> to_date function is returning incorrect value
> -
>
> Key: SPARK-39536
> URL: https://issues.apache.org/jira/browse/SPARK-39536
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.1
> Environment: I'm facing this issue in databricks community edition. 
> I'm using DBR 10.4 LTS.
>Reporter: Sridhar Varanasi
>Priority: Major
> Attachments: to_date_issue.PNG
>
>
> Hi,
>  
> I have a dataframe which has a column containing dates in string format. Now 
> while converting this to date type using to_date , it's giving incorrect date 
> format values. Following is the example code.
>  
>  
> df = spark.createDataFrame(
>     [("11/25/1991",), ("1/2/1991",), ("11/30/1991",)], 
>     ['date_str']
> )
>  
> spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
>  
> df = (df
>                  .withColumn('new_date'
>                              ,to_date(col('date_str'),'mm/dd/')))
> display(df)
>  
>  
> In the above dataframe we get the date converted correctly for the 2nd row 
> but for 1st and 3rd row we are getting incorrect dates post conversion.
>  
>  
> Could you please look into this issue?
>  
> Thanks,
> Sridhar



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39549) How to get access to the data created in different Spark Applications

2022-06-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39549:
-
Component/s: (was: Project Infra)

> How to get access to the data created in different Spark Applications
> -
>
> Key: SPARK-39549
> URL: https://issues.apache.org/jira/browse/SPARK-39549
> Project: Spark
>  Issue Type: Question
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.3.0
>Reporter: Chenyang Zhang
>Priority: Major
>
> I am working on a project using PySpark and I am blocked because I want to 
> share data between different Spark applications. The situation is that we 
> have a running java server which can handles incoming requests with a thread 
> pool, and each thread has a corresponding python process. We want to use 
> pandas on Spark, but have it so that any of the python processes can access 
> the same data in spark. For example, in a python process, we created a 
> SparkSession, read some data, modified the data using pandas on Spark api and 
> we want to get access to that data in a different python process. The core 
> problem is how to share data between different SparkSession or how to let 
> different python process connect to the same SparkSession. I researched a bit 
> but it seems impossible to share data between different python process 
> without using external DB or connect to the same SparkSession. Generally, is 
> this possible / what would be the recommended way to do this with the least 
> impact on performance?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39549) How to get access to the data created in different Spark Applications

2022-06-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39549.
--
Resolution: Invalid

For questions, let's leverage Spark mailing list.

> How to get access to the data created in different Spark Applications
> -
>
> Key: SPARK-39549
> URL: https://issues.apache.org/jira/browse/SPARK-39549
> Project: Spark
>  Issue Type: Question
>  Components: Pandas API on Spark, Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Chenyang Zhang
>Priority: Major
>
> I am working on a project using PySpark and I am blocked because I want to 
> share data between different Spark applications. The situation is that we 
> have a running java server which can handles incoming requests with a thread 
> pool, and each thread has a corresponding python process. We want to use 
> pandas on Spark, but have it so that any of the python processes can access 
> the same data in spark. For example, in a python process, we created a 
> SparkSession, read some data, modified the data using pandas on Spark api and 
> we want to get access to that data in a different python process. The core 
> problem is how to share data between different SparkSession or how to let 
> different python process connect to the same SparkSession. I researched a bit 
> but it seems impossible to share data between different python process 
> without using external DB or connect to the same SparkSession. Generally, is 
> this possible / what would be the recommended way to do this with the least 
> impact on performance?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39549) How to get access to the data created in different Spark Applications

2022-06-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557202#comment-17557202
 ] 

Hyukjin Kwon commented on SPARK-39549:
--

You should either write down into a file or a table, and read it in a different 
Spark application. Or, have to implement a logic to share one Spark session 
(e.g., zeppelin does).

> How to get access to the data created in different Spark Applications
> -
>
> Key: SPARK-39549
> URL: https://issues.apache.org/jira/browse/SPARK-39549
> Project: Spark
>  Issue Type: Question
>  Components: Pandas API on Spark, Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Chenyang Zhang
>Priority: Major
>
> I am working on a project using PySpark and I am blocked because I want to 
> share data between different Spark applications. The situation is that we 
> have a running java server which can handles incoming requests with a thread 
> pool, and each thread has a corresponding python process. We want to use 
> pandas on Spark, but have it so that any of the python processes can access 
> the same data in spark. For example, in a python process, we created a 
> SparkSession, read some data, modified the data using pandas on Spark api and 
> we want to get access to that data in a different python process. The core 
> problem is how to share data between different SparkSession or how to let 
> different python process connect to the same SparkSession. I researched a bit 
> but it seems impossible to share data between different python process 
> without using external DB or connect to the same SparkSession. Generally, is 
> this possible / what would be the recommended way to do this with the least 
> impact on performance?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39551) Add AQE invalid plan check

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557201#comment-17557201
 ] 

Apache Spark commented on SPARK-39551:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/36953

> Add AQE invalid plan check
> --
>
> Key: SPARK-39551
> URL: https://issues.apache.org/jira/browse/SPARK-39551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wei Xue
>Priority: Minor
>
> AQE logical optimization rules can lead to invalid physical plans as certain 
> physical plan nodes are not compatible with others. E.g., 
> `BroadcastExchangeExec` can only work as a direct child of broadcast join 
> nodes.
> Logical optimizations, on the other hand, are not (and should not be) aware 
> of such restrictions. So a general solution here is to check for invalid 
> plans and throw exceptions, which can be caught by AQE replanning process. 
> And if such an exception is captured, AQE can void the current replanning 
> result and keep using the latest valid plan.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39551) Add AQE invalid plan check

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557199#comment-17557199
 ] 

Apache Spark commented on SPARK-39551:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/36953

> Add AQE invalid plan check
> --
>
> Key: SPARK-39551
> URL: https://issues.apache.org/jira/browse/SPARK-39551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wei Xue
>Priority: Minor
>
> AQE logical optimization rules can lead to invalid physical plans as certain 
> physical plan nodes are not compatible with others. E.g., 
> `BroadcastExchangeExec` can only work as a direct child of broadcast join 
> nodes.
> Logical optimizations, on the other hand, are not (and should not be) aware 
> of such restrictions. So a general solution here is to check for invalid 
> plans and throw exceptions, which can be caught by AQE replanning process. 
> And if such an exception is captured, AQE can void the current replanning 
> result and keep using the latest valid plan.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39551) Add AQE invalid plan check

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39551:


Assignee: (was: Apache Spark)

> Add AQE invalid plan check
> --
>
> Key: SPARK-39551
> URL: https://issues.apache.org/jira/browse/SPARK-39551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wei Xue
>Priority: Minor
>
> AQE logical optimization rules can lead to invalid physical plans as certain 
> physical plan nodes are not compatible with others. E.g., 
> `BroadcastExchangeExec` can only work as a direct child of broadcast join 
> nodes.
> Logical optimizations, on the other hand, are not (and should not be) aware 
> of such restrictions. So a general solution here is to check for invalid 
> plans and throw exceptions, which can be caught by AQE replanning process. 
> And if such an exception is captured, AQE can void the current replanning 
> result and keep using the latest valid plan.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39551) Add AQE invalid plan check

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39551:


Assignee: Apache Spark

> Add AQE invalid plan check
> --
>
> Key: SPARK-39551
> URL: https://issues.apache.org/jira/browse/SPARK-39551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wei Xue
>Assignee: Apache Spark
>Priority: Minor
>
> AQE logical optimization rules can lead to invalid physical plans as certain 
> physical plan nodes are not compatible with others. E.g., 
> `BroadcastExchangeExec` can only work as a direct child of broadcast join 
> nodes.
> Logical optimizations, on the other hand, are not (and should not be) aware 
> of such restrictions. So a general solution here is to check for invalid 
> plans and throw exceptions, which can be caught by AQE replanning process. 
> And if such an exception is captured, AQE can void the current replanning 
> result and keep using the latest valid plan.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39551) Add AQE invalid plan check

2022-06-21 Thread Wei Xue (Jira)

Wei Xue created SPARK-39551:
---

 Summary: Add AQE invalid plan check
 Key: SPARK-39551
 URL: https://issues.apache.org/jira/browse/SPARK-39551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wei Xue


AQE logical optimization rules can lead to invalid physical plans as certain 
physical plan nodes are not compatible with others. E.g., 
`BroadcastExchangeExec` can only work as a direct child of broadcast join nodes.

Logical optimizations, on the other hand, are not (and should not be) aware of 
such restrictions. So a general solution here is to check for invalid plans and 
throw exceptions, which can be caught by AQE replanning process. And if such an 
exception is captured, AQE can void the current replanning result and keep 
using the latest valid plan.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39545:
-
Description: 
{{ExpressionSet ++}} method in the master branch a little slower than the 
branch-3.3 with Scala-2.13

 

For example, write a microbenchmark as follows and run with Scala 2.13:
{code:java}
val valuesPerIteration = 10
val benchmark = new Benchmark("Test ExpressionSet ++ ", valuesPerIteration, 
output = output)
val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))
val initialSet = ExpressionSet(aUpper + 1 :: Rand(0) :: Nil)
val setToAddWithSameDeterministicExpression = ExpressionSet(aUpper + 1 :: 
Rand(0) :: Nil)

benchmark.addCase("Test ++") { _: Int =>
  for (_ <- 0L until valuesPerIteration) {
initialSet ++ setToAddWithSameDeterministicExpression
  }
}

benchmark.run() {code}
*branch-3.3 result:*
 
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45
Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz
Test ExpressionSet ++ :   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

Test ++  14 16  
 4  7.2 139.1   1.0X
 {code}
 
*master result :*
{code:java}
OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45
Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz
Test ExpressionSet ++ :   Best Time(ms)   Avg Time(ms)   
Stdev(ms)Rate(M/s)   Per Row(ns)   Relative

Test ++  16 19  
 5  6.1 163.9   1.0X
 {code}

  was:ExpressionSet ++ with 


> Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the 
> performance
> -
>
> Key: SPARK-39545
> URL: https://issues.apache.org/jira/browse/SPARK-39545
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> {{ExpressionSet ++}} method in the master branch a little slower than the 
> branch-3.3 with Scala-2.13
>  
> For example, write a microbenchmark as follows and run with Scala 2.13:
> {code:java}
> val valuesPerIteration = 10
> val benchmark = new Benchmark("Test ExpressionSet ++ ", 
> valuesPerIteration, output = output)
> val aUpper = AttributeReference("A", IntegerType)(exprId = ExprId(1))
> val initialSet = ExpressionSet(aUpper + 1 :: Rand(0) :: Nil)
> val setToAddWithSameDeterministicExpression = ExpressionSet(aUpper + 1 :: 
> Rand(0) :: Nil)
> benchmark.addCase("Test ++") { _: Int =>
>   for (_ <- 0L until valuesPerIteration) {
> initialSet ++ setToAddWithSameDeterministicExpression
>   }
> }
> benchmark.run() {code}
> *branch-3.3 result:*
>  
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45
> Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz
> Test ExpressionSet ++ :   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> Test ++  14 16
>4  7.2 139.1   1.0X
>  {code}
>  
> *master result :*
> {code:java}
> OpenJDK 64-Bit Server VM 1.8.0_312-b07 on Linux 4.14.0_1-0-0-45
> Intel(R) Xeon(R) Gold 6XXXC CPU @ 2.60GHz
> Test ExpressionSet ++ :   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)Rate(M/s)   Per Row(ns)   Relative
> 
> Test ++  16 19
>5  6.1 163.9   1.0X
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39541:


Assignee: (was: Apache Spark)

> [Yarn] Diagnostics of yarn UI did not display the exception of driver when 
> driver exit before regiserAM
> ---
>
> Key: SPARK-39541
> URL: https://issues.apache.org/jira/browse/SPARK-39541
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: liangyongyuan
>Priority: Major
>
> If commit a job in yarn cluster mode and driver exited before 
> registerAM，Diagnostics of yarn UI did not show the exception that was throwed 
> by driver .Yarn UI only show :
> Application application_xxx failed 1 times (global limit =10; local limit is 
> =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13
>  
> User must view spark log to find the real reason.for example,spark log shows 
> {code:java}
> 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: 
> User class threw exception: java.lang.ArithmeticException: / by zero
> java.lang.ArithmeticException: / by zero
>   at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10)
>   at org.examples.appErrorDemo3.main(appErrorDemo3.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736)
>  {code}
>  
> The reason of this issue is that if driver would not call unregisterAM exited 
> before registerAM ，then yarn UI could not show the real diagnostic 
> information.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557190#comment-17557190
 ] 

Apache Spark commented on SPARK-39541:
--

User 'lyy-pineapple' has created a pull request for this issue:
https://github.com/apache/spark/pull/36952

> [Yarn] Diagnostics of yarn UI did not display the exception of driver when 
> driver exit before regiserAM
> ---
>
> Key: SPARK-39541
> URL: https://issues.apache.org/jira/browse/SPARK-39541
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: liangyongyuan
>Priority: Major
>
> If commit a job in yarn cluster mode and driver exited before 
> registerAM，Diagnostics of yarn UI did not show the exception that was throwed 
> by driver .Yarn UI only show :
> Application application_xxx failed 1 times (global limit =10; local limit is 
> =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13
>  
> User must view spark log to find the real reason.for example,spark log shows 
> {code:java}
> 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: 
> User class threw exception: java.lang.ArithmeticException: / by zero
> java.lang.ArithmeticException: / by zero
>   at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10)
>   at org.examples.appErrorDemo3.main(appErrorDemo3.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736)
>  {code}
>  
> The reason of this issue is that if driver would not call unregisterAM exited 
> before registerAM ，then yarn UI could not show the real diagnostic 
> information.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39541:


Assignee: Apache Spark

> [Yarn] Diagnostics of yarn UI did not display the exception of driver when 
> driver exit before regiserAM
> ---
>
> Key: SPARK-39541
> URL: https://issues.apache.org/jira/browse/SPARK-39541
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: liangyongyuan
>Assignee: Apache Spark
>Priority: Major
>
> If commit a job in yarn cluster mode and driver exited before 
> registerAM，Diagnostics of yarn UI did not show the exception that was throwed 
> by driver .Yarn UI only show :
> Application application_xxx failed 1 times (global limit =10; local limit is 
> =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13
>  
> User must view spark log to find the real reason.for example,spark log shows 
> {code:java}
> 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: 
> User class threw exception: java.lang.ArithmeticException: / by zero
> java.lang.ArithmeticException: / by zero
>   at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10)
>   at org.examples.appErrorDemo3.main(appErrorDemo3.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736)
>  {code}
>  
> The reason of this issue is that if driver would not call unregisterAM exited 
> before registerAM ，then yarn UI could not show the real diagnostic 
> information.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-06-21 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557179#comment-17557179
 ] 

Yuming Wang commented on SPARK-38614:
-

Thank you for reporting this issue. Workaround:
{code:sql}
set 
spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.LimitPushDownThroughWindow;
{code}


> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: ZygD
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39533) Deprecate scoreLabelsWeight in BinaryClassificationMetrics

2022-06-21 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-39533:
-
Summary: Deprecate scoreLabelsWeight in BinaryClassificationMetrics  (was: 
Remove scoreLabelsWeight in BinaryClassificationMetrics)

> Deprecate scoreLabelsWeight in BinaryClassificationMetrics
> --
>
> Key: SPARK-39533
> URL: https://issues.apache.org/jira/browse/SPARK-39533
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> scoreLabelsWeight in BinaryClassificationMetrics is a public variable,
> but it should be private, moveover, it is only used once, so move it to the 
> internal call place. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39533) Deprecate scoreLabelsWeight in BinaryClassificationMetrics

2022-06-21 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-39533:
-
Description: 
scoreLabelsWeight in BinaryClassificationMetrics is a public variable,

but it should be private, moveover, it is only used once, so deprecate it now 
and remove it in 4.0.0 

  was:
scoreLabelsWeight in BinaryClassificationMetrics is a public variable,

but it should be private, moveover, it is only used once, so move it to the 
internal call place. 


> Deprecate scoreLabelsWeight in BinaryClassificationMetrics
> --
>
> Key: SPARK-39533
> URL: https://issues.apache.org/jira/browse/SPARK-39533
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 3.4.0
>Reporter: zhengruifeng
>Priority: Minor
>
> scoreLabelsWeight in BinaryClassificationMetrics is a public variable,
> but it should be private, moveover, it is only used once, so deprecate it now 
> and remove it in 4.0.0 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39540) Upgrade mysql-connector-java to 8.0.28

2022-06-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39540.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36938
[https://github.com/apache/spark/pull/36938]

> Upgrade mysql-connector-java to 8.0.28
> --
>
> Key: SPARK-39540
> URL: https://issues.apache.org/jira/browse/SPARK-39540
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> Improper Handling of Insufficient Permissions or Privileges in MySQL 
> Connectors Java.
> Vulnerability in the MySQL Connectors product of Oracle MySQL (component: 
> Connector/J). Supported versions that are affected are 8.0.27 and prior. 
> Difficult to exploit vulnerability allows high privileged attacker with 
> network access via multiple protocols to compromise MySQL Connectors. 
> Successful attacks of this vulnerability can result in takeover of MySQL 
> Connectors. CVSS 3.1 Base Score 6.6 (Confidentiality, Integrity and 
> Availability impacts). CVSS Vector: 
> (CVSS:3.1/AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:H/A:H).
> [CVE-2022-21363|https://nvd.nist.gov/vuln/detail/CVE-2022-21363] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39540) Upgrade mysql-connector-java to 8.0.28

2022-06-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39540:
-

Assignee: Bjørn Jørgensen

> Upgrade mysql-connector-java to 8.0.28
> --
>
> Key: SPARK-39540
> URL: https://issues.apache.org/jira/browse/SPARK-39540
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
>
> Improper Handling of Insufficient Permissions or Privileges in MySQL 
> Connectors Java.
> Vulnerability in the MySQL Connectors product of Oracle MySQL (component: 
> Connector/J). Supported versions that are affected are 8.0.27 and prior. 
> Difficult to exploit vulnerability allows high privileged attacker with 
> network access via multiple protocols to compromise MySQL Connectors. 
> Successful attacks of this vulnerability can result in takeover of MySQL 
> Connectors. CVSS 3.1 Base Score 6.6 (Confidentiality, Integrity and 
> Availability impacts). CVSS Vector: 
> (CVSS:3.1/AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:H/A:H).
> [CVE-2022-21363|https://nvd.nist.gov/vuln/detail/CVE-2022-21363] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39540) Upgrade mysql-connector-java to 8.0.29

2022-06-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39540:
--
Summary: Upgrade mysql-connector-java to 8.0.29  (was: Upgrade 
mysql-connector-java to 8.0.28)

> Upgrade mysql-connector-java to 8.0.29
> --
>
> Key: SPARK-39540
> URL: https://issues.apache.org/jira/browse/SPARK-39540
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Bjørn Jørgensen
>Priority: Major
> Fix For: 3.4.0
>
>
> Improper Handling of Insufficient Permissions or Privileges in MySQL 
> Connectors Java.
> Vulnerability in the MySQL Connectors product of Oracle MySQL (component: 
> Connector/J). Supported versions that are affected are 8.0.27 and prior. 
> Difficult to exploit vulnerability allows high privileged attacker with 
> network access via multiple protocols to compromise MySQL Connectors. 
> Successful attacks of this vulnerability can result in takeover of MySQL 
> Connectors. CVSS 3.1 Base Score 6.6 (Confidentiality, Integrity and 
> Availability impacts). CVSS Vector: 
> (CVSS:3.1/AV:N/AC:H/PR:H/UI:N/S:U/C:H/I:H/A:H).
> [CVE-2022-21363|https://nvd.nist.gov/vuln/detail/CVE-2022-21363] 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38614:


Assignee: Apache Spark

> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: ZygD
>Assignee: Apache Spark
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557160#comment-17557160
 ] 

Apache Spark commented on SPARK-38614:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36951

> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: ZygD
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557159#comment-17557159
 ] 

Apache Spark commented on SPARK-38614:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36951

> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: ZygD
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38614) After Spark update, df.show() shows incorrect F.percent_rank results

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38614:


Assignee: (was: Apache Spark)

> After Spark update, df.show() shows incorrect F.percent_rank results
> 
>
> Key: SPARK-38614
> URL: https://issues.apache.org/jira/browse/SPARK-38614
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1, 3.3.0
>Reporter: ZygD
>Priority: Major
>  Labels: correctness
>
> Expected result is obtained using Spark 3.1.2, but not 3.2.0, 3.2.1 or 3.3.0.
> *Minimal reproducible example*
> {code:java}
> from pyspark.sql import SparkSession, functions as F, Window as W
> spark = SparkSession.builder.getOrCreate()
>  
> df = spark.range(101).withColumn('pr', F.percent_rank().over(W.orderBy('id')))
> df.show(3)
> df.show(5) {code}
> *Expected result*
> {code:java}
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> +---++
> only showing top 3 rows
> +---++
> | id|  pr|
> +---++
> |  0| 0.0|
> |  1|0.01|
> |  2|0.02|
> |  3|0.03|
> |  4|0.04|
> +---++
> only showing top 5 rows{code}
> *Actual result*
> {code:java}
> +---+--+
> | id|pr|
> +---+--+
> |  0|   0.0|
> |  1|0.|
> |  2|0.|
> +---+--+
> only showing top 3 rows
> +---+---+
> | id| pr|
> +---+---+
> |  0|0.0|
> |  1|0.2|
> |  2|0.4|
> |  3|0.6|
> |  4|0.8|
> +---+---+
> only showing top 5 rows{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars

2022-06-21 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
Currently, DataFrame creation from a list of scalars is unsupported as below:
|>>> spark.createDataFrame([1, 2])
Traceback (most recent call last):
...
    *raise* TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|

 

However, cases below are supported.
|>>> spark.createDataFrame([(1,), (2,)]).collect()
[Row(_1=1), Row(_1=2)]|

 
|>>> schema
StructType([StructField('_1', LongType(), \{*}True\{*})])
>>> spark.createDataFrame([1, 2], schema=schema).collect()
[Row(_1=1), Row(_1=2)]|

 

In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
list of scalars as below:
|scala> Seq(1, 2).toDF().collect()
res6: Array[org.apache.spark.sql.Row] = Array([1], [2])|

 

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. See more at 

[https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]

 

  was:
Currently, DataFrame creation from a list of scalars is unsupported as below:
|>>> spark.createDataFrame([1, 2])
Traceback (most recent call last):
...
    *raise* TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|

 

However, cases below are supported.
|>>> spark.createDataFrame([(1,), (2,)]).collect()
[Row(_1=1), Row(_1=2)]|

 
|>>> schema
StructType([StructField('_1', LongType(), {*}True{*})])
>>> spark.createDataFrame([1, 2], schema=schema).collect()
[Row(_1=1), Row(_1=2)]|

 

In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
list of scalars as below:
|scala> Seq(1, 2).toDF().collect()
res6: Array[org.apache.spark.sql.Row] = Array([1], [2]|

 

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. See more at 

[https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]

 


> Support `createDataFrame` from a list of scalars
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of scalars is unsupported as below:
> |>>> spark.createDataFrame([1, 2])
> Traceback (most recent call last):
> ...
>     *raise* TypeError("Can not infer schema for type: %s" % type(row))
> TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|
>  
> However, cases below are supported.
> |>>> spark.createDataFrame([(1,), (2,)]).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> |>>> schema
> StructType([StructField('_1', LongType(), \{*}True\{*})])
> >>> spark.createDataFrame([1, 2], schema=schema).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
> list of scalars as below:
> |scala> Seq(1, 2).toDF().collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1], [2])|
>  
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. See more at 
> [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39496) Inline eval path cannot handle null structs

2022-06-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39496:
-
Fix Version/s: 3.1.3

> Inline eval path cannot handle null structs
> ---
>
> Key: SPARK-39496
> URL: https://issues.apache.org/jira/browse/SPARK-39496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.1.3, 3.2.2, 3.4.0, 3.3.1
>
>
> This issue is somewhat similar to SPARK-39061, but for the eval path rather 
> than the codegen path.
> Example:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select inline(array(named_struct('a', 1, 'b', 2), null));
> {noformat}
> This results in a NullPointerException:
> {noformat}
> 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
> {noformat}
> The next example doesn't require setting {{spark.sql.codegen.wholeStage}} to 
> {{{}false{}}}:
> {noformat}
> val dfWide = (Seq((1))
>   .toDF("col0")
>   .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))
> val df = (dfWide
>   .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as 
> struct_array"))
> df.selectExpr("*", "inline(struct_array)").collect
> {noformat}
> The result is similar:
> {noformat}
> 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 
> 1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557155#comment-17557155
 ] 

Hyukjin Kwon commented on SPARK-39519:
--

Thanks for your investigation.

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39494) Support `createDataFrame` from a list of scalars

2022-06-21 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39494:
-
Description: 
Currently, DataFrame creation from a list of scalars is unsupported as below:
|>>> spark.createDataFrame([1, 2])
Traceback (most recent call last):
...
    *raise* TypeError("Can not infer schema for type: %s" % type(row))
TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|

 

However, cases below are supported.
|>>> spark.createDataFrame([(1,), (2,)]).collect()
[Row(_1=1), Row(_1=2)]|

 
|>>> schema
StructType([StructField('_1', LongType(), {*}True{*})])
>>> spark.createDataFrame([1, 2], schema=schema).collect()
[Row(_1=1), Row(_1=2)]|

 

In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
list of scalars as below:
|scala> Seq(1, 2).toDF().collect()
res6: Array[org.apache.spark.sql.Row] = Array([1], [2]|

 

To maintain API consistency, we propose to support DataFrame creation from a 
list of scalars. See more at 

[https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]

 

  was:
- Support `createDataFrame` from a list of scalars.

- Standardize error messages when the input list contains any scalars.


> Support `createDataFrame` from a list of scalars
> 
>
> Key: SPARK-39494
> URL: https://issues.apache.org/jira/browse/SPARK-39494
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Currently, DataFrame creation from a list of scalars is unsupported as below:
> |>>> spark.createDataFrame([1, 2])
> Traceback (most recent call last):
> ...
>     *raise* TypeError("Can not infer schema for type: %s" % type(row))
> TypeError: Can *not* infer schema *for* type: <{*}class{*} '{*}int{*}'>|
>  
> However, cases below are supported.
> |>>> spark.createDataFrame([(1,), (2,)]).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> |>>> schema
> StructType([StructField('_1', LongType(), {*}True{*})])
> >>> spark.createDataFrame([1, 2], schema=schema).collect()
> [Row(_1=1), Row(_1=2)]|
>  
> In addition, Spark DataFrame Scala API supports creating a DataFrame from a  
> list of scalars as below:
> |scala> Seq(1, 2).toDF().collect()
> res6: Array[org.apache.spark.sql.Row] = Array([1], [2]|
>  
> To maintain API consistency, we propose to support DataFrame creation from a 
> list of scalars. See more at 
> [https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing.|https://docs.google.com/document/d/1Rd20PVbVxNrLfOmDtetVRxkgJQhgAAtJp6XAAZfGQgc/edit?usp=sharing]
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

2022-06-21 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-39550:
-
Description: 
 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

Especially, `value_counts` returns an Index (rather than a MultiIndex), under 
the hood, a Spark column of StructType (rather than multiple Spark columns), so 
when Arrow Execution is enabled, Arrow converts the StructType column to a 
dictionary, where we expect a tuple instead.

 

  was:
 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,

 

 
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

 

Especially, `value_counts` returns a Index (rather than a MultiIndex), under 
the hood, a Spark column of StructType (rather than multiple Spark columns), so 
when Arrow Execution is enabled, Arrow converts the StructType column to a 
dictionary, where we expect a tuple instad.

 


> Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
> ---
>
> Key: SPARK-39550
> URL: https://issues.apache.org/jira/browse/SPARK-39550
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> When Arrow Execution is enabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'true'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> {'__index_level_0__': 1, '__index_level_1__': 'a'}    1
> {'__index_level_0__': 2, '__index_level_1__': 'b'}    1
> dtype: int64
> {code}
> When Arrow Execution is disabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'false'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> (1, a)    1
> (2, b)    1
> dtype: int64 {code}
> Notice how indexes of their results are different.
> Especially, `value_counts` returns an Index (rather than a MultiIndex), under 
> the hood, a Spark column of StructType (rather than multiple Spark columns), 
> so when Arrow Execution is enabled, Arrow converts the StructType column to a 
> dictionary, where we expect a tuple instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

2022-06-21 Thread Xinrong Meng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557145#comment-17557145
 ] 

Xinrong Meng commented on SPARK-39550:
--

I am working on that.

> Fix `MultiIndex.value_counts()` when Arrow Execution is enabled
> ---
>
> Key: SPARK-39550
> URL: https://issues.apache.org/jira/browse/SPARK-39550
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
>  
> When Arrow Execution is enabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'true'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> {'__index_level_0__': 1, '__index_level_1__': 'a'}    1
> {'__index_level_0__': 2, '__index_level_1__': 'b'}    1
> dtype: int64
> {code}
> When Arrow Execution is disabled,
> {code:java}
> >>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
> 'false'
> >>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
> (1, a)    1
> (2, b)    1
> dtype: int64 {code}
> Notice how indexes of their results are different.
> Especially, `value_counts` returns an Index (rather than a MultiIndex), under 
> the hood, a Spark column of StructType (rather than multiple Spark columns), 
> so when Arrow Execution is enabled, Arrow converts the StructType column to a 
> dictionary, where we expect a tuple instead.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39550) Fix `MultiIndex.value_counts()` when Arrow Execution is enabled

2022-06-21 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-39550:


 Summary: Fix `MultiIndex.value_counts()` when Arrow Execution is 
enabled
 Key: SPARK-39550
 URL: https://issues.apache.org/jira/browse/SPARK-39550
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark, PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


 

When Arrow Execution is enabled,
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'true'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
{'__index_level_0__': 1, '__index_level_1__': 'a'}    1
{'__index_level_0__': 2, '__index_level_1__': 'b'}    1
dtype: int64
{code}
When Arrow Execution is disabled,

 

 
{code:java}
>>> spark.conf.get("spark.sql.execution.arrow.pyspark.enabled")
'false'
>>> ps.MultiIndex.from_arrays([[1,2], ['a','b']]).value_counts()
(1, a)    1
(2, b)    1
dtype: int64 {code}
Notice how indexes of their results are different.

 

Especially, `value_counts` returns a Index (rather than a MultiIndex), under 
the hood, a Spark column of StructType (rather than multiple Spark columns), so 
when Arrow Execution is enabled, Arrow converts the StructType column to a 
dictionary, where we expect a tuple instad.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39549) How to get access to the data created in different Spark Applications

2022-06-21 Thread Chenyang Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyang Zhang updated SPARK-39549:
---
Priority: Major  (was: Critical)

> How to get access to the data created in different Spark Applications
> -
>
> Key: SPARK-39549
> URL: https://issues.apache.org/jira/browse/SPARK-39549
> Project: Spark
>  Issue Type: Question
>  Components: Pandas API on Spark, Project Infra, PySpark
>Affects Versions: 3.3.0
>Reporter: Chenyang Zhang
>Priority: Major
>
> I am working on a project using PySpark and I am blocked because I want to 
> share data between different Spark applications. The situation is that we 
> have a running java server which can handles incoming requests with a thread 
> pool, and each thread has a corresponding python process. We want to use 
> pandas on Spark, but have it so that any of the python processes can access 
> the same data in spark. For example, in a python process, we created a 
> SparkSession, read some data, modified the data using pandas on Spark api and 
> we want to get access to that data in a different python process. The core 
> problem is how to share data between different SparkSession or how to let 
> different python process connect to the same SparkSession. I researched a bit 
> but it seems impossible to share data between different python process 
> without using external DB or connect to the same SparkSession. Generally, is 
> this possible / what would be the recommended way to do this with the least 
> impact on performance?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39549) How to get access to the data created in different Spark Applications

2022-06-21 Thread Chenyang Zhang (Jira)

Chenyang Zhang created SPARK-39549:
--

 Summary: How to get access to the data created in different Spark 
Applications
 Key: SPARK-39549
 URL: https://issues.apache.org/jira/browse/SPARK-39549
 Project: Spark
  Issue Type: Question
  Components: Pandas API on Spark, Project Infra, PySpark
Affects Versions: 3.3.0
Reporter: Chenyang Zhang


I am working on a project using PySpark and I am blocked because I want to 
share data between different Spark applications. The situation is that we have 
a running java server which can handles incoming requests with a thread pool, 
and each thread has a corresponding python process. We want to use pandas on 
Spark, but have it so that any of the python processes can access the same data 
in spark. For example, in a python process, we created a SparkSession, read 
some data, modified the data using pandas on Spark api and we want to get 
access to that data in a different python process. The core problem is how to 
share data between different SparkSession or how to let different python 
process connect to the same SparkSession. I researched a bit but it seems 
impossible to share data between different python process without using 
external DB or connect to the same SparkSession. Generally, is this possible / 
what would be the recommended way to do this with the least impact on 
performance?



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38796) Implement the to_number and try_to_number SQL functions according to a new specification

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557130#comment-17557130
 ] 

Apache Spark commented on SPARK-38796:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/36950

> Implement the to_number and try_to_number SQL functions according to a new 
> specification
> 
>
> Key: SPARK-38796
> URL: https://issues.apache.org/jira/browse/SPARK-38796
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Daniel
>Priority: Major
> Fix For: 3.3.0
>
>
> This tracks implementing the 'to_number' and 'try_to_number' SQL function 
> expressions according to new semantics described below. The former is 
> equivalent to the latter except that it throws an exception instead of 
> returning NULL for cases where the input string does not match the format 
> string.
>  
> ---
>  
> *try_to_number function (expr, fmt):*
> Returns 'expr' cast to DECIMAL using formatting 'fmt', or 'NULL' if 'expr' is 
> not a valid match for the given format.
>  
> Syntax: 
> [ S ] [ L | $ ]
> [ 0 | 9 | G | , ] [...]
> [ . | D ] 
> [ 0 | 9 ] [...]       
> [ L | $ ] [ PR | MI | S ] ' }
>  
> *Arguments:*
> 'expr': A STRING expression representing a number. 'expr' may include leading 
> or trailing spaces.
> 'fmt': An STRING literal, specifying the expected format of 'expr'.
>  
> *Returns:*
> A DECIMAL(p, s) where 'p' is the total number of digits ('0' or '9') and 's' 
> is the number of digits after the decimal point, or 0 if there is none.
>  
> *Format elements allowed (case insensitive):*
>  * 0 or 9
>   Specifies an expected digit between '0' and '9'. 
>   A '0' to the left of the decimal points indicates that 'expr' must have at 
> least as many digits. A leading '9' indicates that 'expr' may omit these 
> digits.
>   'expr' must not be larger than the number of digits to the left of the 
> decimal point allowed by the format string.
>   Digits to the right of the decimal point in the format string indicate the 
> most digits that 'expr' may have to the right of the decimal point.
>  * . or D
>   Specifies the position of the decimal point.
>   'expr' does not need to include a decimal point.
>  * , or G
>   Specifies the position of the ',' grouping (thousands) separator.
>   There must be a '0' or '9' to the left of the rightmost grouping separator. 
>   'expr' must match the grouping separator relevant for the size of the 
> number. 
>  * $
>   Specifies the location of the '$' currency sign. This character may only be 
> specified once.
>  * S 
>   Specifies the position of an option '+' or '-' sign. This character may 
> only be specified once.
>  * MI
>   Specifies that 'expr' has an optional '-' sign at the end, but no '+'.
>  * PR
>   Specifies that 'expr' indicates a negative number with wrapping angled 
> brackets ('<1>'). If 'expr' contains any characters other than '0' through 
> '9' and those permitted in 'fmt' a 'NULL' is returned.
>  
> *Examples:*
> {{– The format expects:}}
> {{–  * an optional sign at the beginning,}}
> {{–  * followed by a dollar sign,}}
> {{–  * followed by a number between 3 and 6 digits long,}}
> {{–  * thousands separators,}}
> {{–  * up to two dights beyond the decimal point. }}
> {{> SELECT try_to_number('-$12,345.67', 'S$999,099.99');}}
> {{ -12345.67}}
> {{– The plus sign is optional, and so are fractional digits.}}
> {{> SELECT try_to_number('$345', 'S$999,099.99');}}
> {{ 345.00}}
> {{– The format requires at least three digits.}}
> {{> SELECT try_to_number('$45', 'S$999,099.99');}}
> {{ NULL}}
> {{– The format requires at least three digits.}}
> {{> SELECT try_to_number('$045', 'S$999,099.99');}}
> {{ 45.00}}
> {{– Using brackets to denote negative values}}
> {{> SELECT try_to_number('<1234>', '99PR');}}
> {{ -1234}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39496) Inline eval path cannot handle null structs

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557128#comment-17557128
 ] 

Apache Spark commented on SPARK-39496:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36949

> Inline eval path cannot handle null structs
> ---
>
> Key: SPARK-39496
> URL: https://issues.apache.org/jira/browse/SPARK-39496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1
>
>
> This issue is somewhat similar to SPARK-39061, but for the eval path rather 
> than the codegen path.
> Example:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select inline(array(named_struct('a', 1, 'b', 2), null));
> {noformat}
> This results in a NullPointerException:
> {noformat}
> 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
> {noformat}
> The next example doesn't require setting {{spark.sql.codegen.wholeStage}} to 
> {{{}false{}}}:
> {noformat}
> val dfWide = (Seq((1))
>   .toDF("col0")
>   .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))
> val df = (dfWide
>   .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as 
> struct_array"))
> df.selectExpr("*", "inline(struct_array)").collect
> {noformat}
> The result is similar:
> {noformat}
> 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 
> 1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39496) Inline eval path cannot handle null structs

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39496?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557125#comment-17557125
 ] 

Apache Spark commented on SPARK-39496:
--

User 'bersprockets' has created a pull request for this issue:
https://github.com/apache/spark/pull/36949

> Inline eval path cannot handle null structs
> ---
>
> Key: SPARK-39496
> URL: https://issues.apache.org/jira/browse/SPARK-39496
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.1, 3.3.0, 3.4.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
> Fix For: 3.2.2, 3.4.0, 3.3.1
>
>
> This issue is somewhat similar to SPARK-39061, but for the eval path rather 
> than the codegen path.
> Example:
> {noformat}
> set spark.sql.codegen.wholeStage=false;
> select inline(array(named_struct('a', 1, 'b', 2), null));
> {noformat}
> This results in a NullPointerException:
> {noformat}
> 22/06/16 15:10:06 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
> {noformat}
> The next example doesn't require setting {{spark.sql.codegen.wholeStage}} to 
> {{{}false{}}}:
> {noformat}
> val dfWide = (Seq((1))
>   .toDF("col0")
>   .selectExpr(Seq.tabulate(99)(x => s"$x as col${x + 1}"): _*))
> val df = (dfWide
>   .selectExpr("*", "array(named_struct('a', 1, 'b', 2), null) as 
> struct_array"))
> df.selectExpr("*", "inline(struct_array)").collect
> {noformat}
> The result is similar:
> {noformat}
> 22/06/16 15:18:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 
> 1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:80)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_8$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$11(GenerateExec.scala:122)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557119#comment-17557119
 ] 

Apache Spark commented on SPARK-39548:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/36947

> CreateView Command with a window clause query hit a wrong window definition 
> not found issue
> ---
>
> Key: SPARK-39548
> URL: https://issues.apache.org/jira/browse/SPARK-39548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> This query will hit a w2 window definition not found in `WindowSubstitute` 
> rule, however this is a bug since the w2 definition is defined in the query.
> ```
> create or replace temporary view test_temp_view as
> with step_1 as (
> select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as 
> c) window w2 as (partition by b order by c)) , step_2 as
> (
> select *, max(e) over w1 as max_a_over_w1
> from (select 1 as e, 2 as f, 3 as g)
> join step_1 on true
> window w1 as (partition by f order by g)
> )
> select *
> from step_2
> ```
> Also we can move the unresolved window expression check from 
> `WindowSubstitute` rule  to `CheckAnalysis` phrase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39548:


Assignee: Apache Spark

> CreateView Command with a window clause query hit a wrong window definition 
> not found issue
> ---
>
> Key: SPARK-39548
> URL: https://issues.apache.org/jira/browse/SPARK-39548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>
> This query will hit a w2 window definition not found in `WindowSubstitute` 
> rule, however this is a bug since the w2 definition is defined in the query.
> ```
> create or replace temporary view test_temp_view as
> with step_1 as (
> select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as 
> c) window w2 as (partition by b order by c)) , step_2 as
> (
> select *, max(e) over w1 as max_a_over_w1
> from (select 1 as e, 2 as f, 3 as g)
> join step_1 on true
> window w1 as (partition by f order by g)
> )
> select *
> from step_2
> ```
> Also we can move the unresolved window expression check from 
> `WindowSubstitute` rule  to `CheckAnalysis` phrase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39548:


Assignee: (was: Apache Spark)

> CreateView Command with a window clause query hit a wrong window definition 
> not found issue
> ---
>
> Key: SPARK-39548
> URL: https://issues.apache.org/jira/browse/SPARK-39548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> This query will hit a w2 window definition not found in `WindowSubstitute` 
> rule, however this is a bug since the w2 definition is defined in the query.
> ```
> create or replace temporary view test_temp_view as
> with step_1 as (
> select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as 
> c) window w2 as (partition by b order by c)) , step_2 as
> (
> select *, max(e) over w1 as max_a_over_w1
> from (select 1 as e, 2 as f, 3 as g)
> join step_1 on true
> window w1 as (partition by f order by g)
> )
> select *
> from step_2
> ```
> Also we can move the unresolved window expression check from 
> `WindowSubstitute` rule  to `CheckAnalysis` phrase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue

2022-06-21 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-39548:
-
Summary: CreateView Command with a window clause query hit a wrong window 
definition not found issue  (was: CreateView Command with a window clause query 
hit a wrong window definition not found issue.)

> CreateView Command with a window clause query hit a wrong window definition 
> not found issue
> ---
>
> Key: SPARK-39548
> URL: https://issues.apache.org/jira/browse/SPARK-39548
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> This query will hit a w2 window definition not found in `WindowSubstitute` 
> rule, however this is a bug since the w2 definition is defined in the query.
> ```
> create or replace temporary view test_temp_view as
> with step_1 as (
> select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as 
> c) window w2 as (partition by b order by c)) , step_2 as
> (
> select *, max(e) over w1 as max_a_over_w1
> from (select 1 as e, 2 as f, 3 as g)
> join step_1 on true
> window w1 as (partition by f order by g)
> )
> select *
> from step_2
> ```
> Also we can move the unresolved window expression check from 
> `WindowSubstitute` rule  to `CheckAnalysis` phrase.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39548) CreateView Command with a window clause query hit a wrong window definition not found issue.

2022-06-21 Thread Rui Wang (Jira)

Rui Wang created SPARK-39548:


 Summary: CreateView Command with a window clause query hit a wrong 
window definition not found issue.
 Key: SPARK-39548
 URL: https://issues.apache.org/jira/browse/SPARK-39548
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Rui Wang


This query will hit a w2 window definition not found in `WindowSubstitute` 
rule, however this is a bug since the w2 definition is defined in the query.

```
create or replace temporary view test_temp_view as
with step_1 as (
select * , min(a) over w2 as min_a_over_w2 from (select 1 as a, 2 as b, 3 as c) 
window w2 as (partition by b order by c)) , step_2 as
(
select *, max(e) over w1 as max_a_over_w1
from (select 1 as e, 2 as f, 3 as g)
join step_1 on true
window w1 as (partition by f order by g)
)
select *
from step_2
```


Also we can move the unresolved window expression check from `WindowSubstitute` 
rule  to `CheckAnalysis` phrase.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39500) Ivy doesn't work correctly on IPv6-only environment

2022-06-21 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557101#comment-17557101
 ] 

Erik Krogen commented on SPARK-39500:
-

Thanks for clarifying!

> Ivy doesn't work correctly on IPv6-only environment
> ---
>
> Key: SPARK-39500
> URL: https://issues.apache.org/jira/browse/SPARK-39500
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Ivy doesn't work correctly on IPv6.
> {code}
>   SparkSubmitUtils.resolveMavenCoordinates(
> "org.apache.logging.log4j:log4j-api:2.17.2",
> SparkSubmitUtils.buildIvySettings(None, Some("/tmp/ivy")),
> transitive = true)
> {code}
> {code}
> % bin/spark-shell
> 22/06/16 22:22:12 WARN Utils: Your hostname, m1ipv6.local resolves to a 
> loopback address: 127.0.0.1; using 2600:1700:232e:3de0:0:0:0:b instead (on 
> interface en0)
> 22/06/16 22:22:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> = https://ipv6.repo1.maven.org/maven2/
> =https://maven-central.storage-download.googleapis.com/maven2/
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 22/06/16 22:22:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Spark context Web UI available at http://unknown1498776019fa.attlocal.net:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1655443334687).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.4.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.16 (OpenJDK 64-Bit Server VM, Java 17.0.3)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :paste -raw
> // Entering paste mode (ctrl-D to finish)
> package org.apache.spark.deploy
> object Download {
>   SparkSubmitUtils.resolveMavenCoordinates(
> "org.apache.logging.log4j:log4j-api:2.17.2",
> SparkSubmitUtils.buildIvySettings(None, Some("/tmp/ivy")),
> transitive = true)
> }
> // Exiting paste mode, now interpreting.
> scala> org.apache.spark.deploy.Download
> = https://ipv6.repo1.maven.org/maven2/
> =https://maven-central.storage-download.googleapis.com/maven2/
> :: loading settings :: url = 
> jar:file:/Users/dongjoon/APACHE/spark/assembly/target/scala-2.12/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> Ivy Default Cache set to: /tmp/ivy/cache
> The jars for the packages stored in: /tmp/ivy/jars
> org.apache.logging.log4j#log4j-api added as a dependency
> :: resolving dependencies :: 
> org.apache.spark#spark-submit-parent-f47b503f-897e-4b92-95da-3806c32c220f;1.0
> confs: [default]
> :: resolution report :: resolve 95ms :: artifacts dl 0ms
> :: modules in use:
> -
> |  |modules||   artifacts   |
> |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
> -
> |  default |   1   |   0   |   0   |   0   ||   0   |   0   |
> -
> :: problems summary ::
>  WARNINGS
> module not found: org.apache.logging.log4j#log4j-api;2.17.2
>  local-m2-cache: tried
>   
> file:/Users/dongjoon/.m2/repository/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.pom
>   -- artifact org.apache.logging.log4j#log4j-api;2.17.2!log4j-api.jar:
>   
> file:/Users/dongjoon/.m2/repository/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.jar
>  local-ivy-cache: tried
>   
> /tmp/ivy/local/org.apache.logging.log4j/log4j-api/2.17.2/ivys/ivy.xml
>   -- artifact org.apache.logging.log4j#log4j-api;2.17.2!log4j-api.jar:
>   
> /tmp/ivy/local/org.apache.logging.log4j/log4j-api/2.17.2/jars/log4j-api.jar
>  ipv6: tried
>   
> https://ipv6.repo1.maven.org/maven2/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.pom
>   -- artifact org.apache.logging.log4j#log4j-api;2.17.2!log4j-api.jar:
>   
> https://ipv6.repo1.maven.org/maven2/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.jar
>  central: tried
>   
> https://maven-central.storage-download.googleapis.com/maven2/org/apache/logging/log4j/log4j-api/2.17.2/log4j-api-2.17.2.pom
>   -- artifact

[jira] [Commented] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557062#comment-17557062
 ] 

Apache Spark commented on SPARK-39547:
--

User 'singhpk234' has created a pull request for this issue:
https://github.com/apache/spark/pull/36948

> V2SessionCatalog should not throw NoSuchDatabaseException in 
> loadNamespaceMetadata
> --
>
> Key: SPARK-39547
> URL: https://issues.apache.org/jira/browse/SPARK-39547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Priority: Minor
>
> DROP NAMESPACE IF EXISTS
> {table}
>  
> if a catalog doesn't overrides `namespaceExists` it by default uses 
> `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata 
> throws a `NoSuchDatabaseException` which is not catched and we see failures 
> even with `if exists` clause. One such use case we observed was in iceberg 
> table a post test clean up was failing with `NoSuchDatabaseException` now.
>  
> Found {color:#00}V2SessionCatalog 
> `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` 
> {color}was also throwing the same unlike 
> `{color:#00}JDBCTableCatalog`{color}
> ref a stack trace :
> {quote}Database 'db' not found
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' 
> not found
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247)
> at 
> org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97)
> at 
> org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98)
> at 
> org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39547:


Assignee: Apache Spark

> V2SessionCatalog should not throw NoSuchDatabaseException in 
> loadNamespaceMetadata
> --
>
> Key: SPARK-39547
> URL: https://issues.apache.org/jira/browse/SPARK-39547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Assignee: Apache Spark
>Priority: Minor
>
> DROP NAMESPACE IF EXISTS
> {table}
>  
> if a catalog doesn't overrides `namespaceExists` it by default uses 
> `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata 
> throws a `NoSuchDatabaseException` which is not catched and we see failures 
> even with `if exists` clause. One such use case we observed was in iceberg 
> table a post test clean up was failing with `NoSuchDatabaseException` now.
>  
> Found {color:#00}V2SessionCatalog 
> `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` 
> {color}was also throwing the same unlike 
> `{color:#00}JDBCTableCatalog`{color}
> ref a stack trace :
> {quote}Database 'db' not found
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' 
> not found
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247)
> at 
> org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97)
> at 
> org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98)
> at 
> org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39547:


Assignee: (was: Apache Spark)

> V2SessionCatalog should not throw NoSuchDatabaseException in 
> loadNamespaceMetadata
> --
>
> Key: SPARK-39547
> URL: https://issues.apache.org/jira/browse/SPARK-39547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Priority: Minor
>
> DROP NAMESPACE IF EXISTS
> {table}
>  
> if a catalog doesn't overrides `namespaceExists` it by default uses 
> `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata 
> throws a `NoSuchDatabaseException` which is not catched and we see failures 
> even with `if exists` clause. One such use case we observed was in iceberg 
> table a post test clean up was failing with `NoSuchDatabaseException` now.
>  
> Found {color:#00}V2SessionCatalog 
> `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` 
> {color}was also throwing the same unlike 
> `{color:#00}JDBCTableCatalog`{color}
> ref a stack trace :
> {quote}Database 'db' not found
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' 
> not found
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247)
> at 
> org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97)
> at 
> org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98)
> at 
> org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557063#comment-17557063
 ] 

Apache Spark commented on SPARK-39547:
--

User 'singhpk234' has created a pull request for this issue:
https://github.com/apache/spark/pull/36948

> V2SessionCatalog should not throw NoSuchDatabaseException in 
> loadNamespaceMetadata
> --
>
> Key: SPARK-39547
> URL: https://issues.apache.org/jira/browse/SPARK-39547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Priority: Minor
>
> DROP NAMESPACE IF EXISTS
> {table}
>  
> if a catalog doesn't overrides `namespaceExists` it by default uses 
> `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata 
> throws a `NoSuchDatabaseException` which is not catched and we see failures 
> even with `if exists` clause. One such use case we observed was in iceberg 
> table a post test clean up was failing with `NoSuchDatabaseException` now.
>  
> Found {color:#00}V2SessionCatalog 
> `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` 
> {color}was also throwing the same unlike 
> `{color:#00}JDBCTableCatalog`{color}
> ref a stack trace :
> {quote}Database 'db' not found
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' 
> not found
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247)
> at 
> org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97)
> at 
> org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98)
> at 
> org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata

2022-06-21 Thread Prashant Singh (Jira)

Prashant Singh created SPARK-39547:
--

 Summary: V2SessionCatalog should not throw NoSuchDatabaseException 
in loadNamespaceMetadata
 Key: SPARK-39547
 URL: https://issues.apache.org/jira/browse/SPARK-39547
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Prashant Singh


DROP NAMESPACE IF EXISTS

{table}

 

if a catalog doesn't overrides `namespaceExists` it by default uses 
`loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata 
throws a `NoSuchDatabaseException` which is not catched and we see failures 
even with `if exists` clause. One such use case we observed was in iceberg 
table a post test clean up was failing with `NoSuchDatabaseException` now.

 

Found {color:#00}V2SessionCatalog 
`{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` {color}was 
also throwing the same unlike `{color:#00}JDBCTableCatalog`{color}

ref a stack trace :
{quote}Database 'db' not found
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' 
not found
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284)
at 
org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247)
at 
org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97)
at 
org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98)
at 
org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
{quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39547) V2SessionCatalog should not throw NoSuchDatabaseException in loadNamespaceMetadata

2022-06-21 Thread Prashant Singh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557056#comment-17557056
 ] 

Prashant Singh commented on SPARK-39547:


will post a pr for it shortly.

> V2SessionCatalog should not throw NoSuchDatabaseException in 
> loadNamespaceMetadata
> --
>
> Key: SPARK-39547
> URL: https://issues.apache.org/jira/browse/SPARK-39547
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Prashant Singh
>Priority: Minor
>
> DROP NAMESPACE IF EXISTS
> {table}
>  
> if a catalog doesn't overrides `namespaceExists` it by default uses 
> `loadNamespaceMetadata` and in case a `db` not exists loadNamespaceMetadata 
> throws a `NoSuchDatabaseException` which is not catched and we see failures 
> even with `if exists` clause. One such use case we observed was in iceberg 
> table a post test clean up was failing with `NoSuchDatabaseException` now.
>  
> Found {color:#00}V2SessionCatalog 
> `{color}{color:#00627a}loadNamespaceMetadata{color}{color:#00}` 
> {color}was also throwing the same unlike 
> `{color:#00}JDBCTableCatalog`{color}
> ref a stack trace :
> {quote}Database 'db' not found
> org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'db' 
> not found
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireDbExists(SessionCatalog.scala:219)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getDatabaseMetadata(SessionCatalog.scala:284)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2SessionCatalog.loadNamespaceMetadata(V2SessionCatalog.scala:247)
> at 
> org.apache.iceberg.spark.SparkSessionCatalog.loadNamespaceMetadata(SparkSessionCatalog.java:97)
> at 
> org.apache.spark.sql.connector.catalog.SupportsNamespaces.namespaceExists(SupportsNamespaces.java:98)
> at 
> org.apache.spark.sql.execution.datasources.v2.DropNamespaceExec.run(DropNamespaceExec.scala:40)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
> at 
> org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
> at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38647) Add SupportsReportOrdering mix in interface for Scan

2022-06-21 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-38647.
--
Fix Version/s: 3.4.0
 Assignee: Enrico Minack
   Resolution: Fixed

> Add SupportsReportOrdering mix in interface for Scan
> 
>
> Key: SPARK-38647
> URL: https://issues.apache.org/jira/browse/SPARK-38647
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Enrico Minack
>Assignee: Enrico Minack
>Priority: Major
> Fix For: 3.4.0
>
>
> As {{SupportsReportPartitioning}} allows implementations of {{Scan}} provide 
> Spark with information about the exiting partitioning of data read by a 
> {{{}DataSourceV2{}}}, a similar mix in interface {{SupportsReportOrdering}} 
> should provide order information.
> This prevents Spark from sorting data if they already exhibit a certain order 
> provided by the source.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-39542.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36939
[https://github.com/apache/spark/pull/36939]

> Improve YARN client mode to support IPv6
> 
>
> Key: SPARK-39542
> URL: https://issues.apache.org/jira/browse/SPARK-39542
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-39542:
-

Assignee: Dongjoon Hyun

> Improve YARN client mode to support IPv6
> 
>
> Key: SPARK-39542
> URL: https://issues.apache.org/jira/browse/SPARK-39542
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39546) Respect port defininitions on K8S pod templates for both driver and executor

2022-06-21 Thread Oliver Koeth (Jira)

Oliver Koeth created SPARK-39546:


 Summary: Respect port defininitions on K8S pod templates for both 
driver and executor
 Key: SPARK-39546
 URL: https://issues.apache.org/jira/browse/SPARK-39546
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Oliver Koeth


*Description:*

Spark on K8S allows to open additional ports for custom purposes on the driver 
pod via the pod template, but ignores the port specification in the executor 
pod template. Port specifications from the pod template should be preserved 
(and extended) for both drivers and executors.

*Scenario:*

I want to run functionality in the executor that exposes data on an additional 
port. In my case, this is monitoring data exposed by Spark's JMX metrics sink 
via the JMX prometheus exporter java agent 
https://github.com/prometheus/jmx_exporter -- the java agent opens an extra 
port inside the container, but for prometheus to detect and scrape the port, it 
must be exposed in the K8S pod resource.
(More background if desired: This seems to be the "classic" Spark 2 way to 
expose prometheus metrics. Spark 3 introduced a native equivalent servlet for 
the driver, but for the executor, only a rather limited set of metrics is 
forwarded via the driver, and that also follows a completely different naming 
scheme. So the JMX + exporter approach still turns out to be more useful for 
me, even in Spark 3)

Expected behavior:

I add the following to my pod template to expose the extra port opened by the 
JMX exporter java agent

spec:
  containers:
  - ...
    ports:
    - containerPort: 8090
      name: jmx-prometheus
      protocol: TCP

Observed behavior:

The port is exposed for driver pods but not for executor pods


*Corresponding code:*

driver pod creation just adds ports
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicDriverFeatureStep.scala]
 (currently line 115)

val driverContainer = new ContainerBuilder(pod.container)
...
  .addNewPort()
...
  .addNewPort()

while executor pod creation replaces the ports
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/BasicExecutorFeatureStep.scala]
 (currently line 211)

val executorContainer = new ContainerBuilder(pod.container)
...
  .withPorts(requiredPorts.asJava)


The current handling is incosistent and unnecessarily limiting. It seems that 
the executor creation could/should just as well preserve pods from the template 
and add extra required ports.


*Workaround:*

It is possible to work around this limitation by adding a full sidecar 
container to the executor pod spec which declares the port. Sidecar containers 
are left unchanged by pod template handling.
As all containers in a pod share the same network, it does not matter which 
container actually declares to expose the port.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39340) DS v2 agg pushdown should allow dots in the name of top-level columns

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556982#comment-17556982
 ] 

Apache Spark commented on SPARK-39340:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/36945

> DS v2 agg pushdown should allow dots in the name of top-level columns
> -
>
> Key: SPARK-39340
> URL: https://issues.apache.org/jira/browse/SPARK-39340
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37939) Use error classes in the parsing errors of properties

2022-06-21 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-37939:
-
Fix Version/s: 3.3.1

> Use error classes in the parsing errors of properties
> -
>
> Key: SPARK-37939
> URL: https://issues.apache.org/jira/browse/SPARK-37939
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: panbingkun
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> Migrate the following errors in QueryParsingErrors:
> * cannotCleanReservedNamespacePropertyError
> * cannotCleanReservedTablePropertyError
> * invalidPropertyKeyForSetQuotedConfigurationError
> * invalidPropertyValueForSetQuotedConfigurationError
> * propertiesAndDbPropertiesBothSpecifiedError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39195) Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556916#comment-17556916
 ] 

Apache Spark commented on SPARK-39195:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36943

> Spark OutputCommitCoordinator should abort stage when committed file not 
> consistent with task status
> 
>
> Key: SPARK-39195
> URL: https://issues.apache.org/jira/browse/SPARK-39195
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39195) Spark OutputCommitCoordinator should abort stage when committed file not consistent with task status

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556915#comment-17556915
 ] 

Apache Spark commented on SPARK-39195:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/36943

> Spark OutputCommitCoordinator should abort stage when committed file not 
> consistent with task status
> 
>
> Key: SPARK-39195
> URL: https://issues.apache.org/jira/browse/SPARK-39195
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556911#comment-17556911
 ] 

Yang Jie commented on SPARK-39519:
--

I will continue to investigate this issue

 

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556910#comment-17556910
 ] 

Yang Jie commented on SPARK-39519:
--

[~hyukjin.kwon] Sorry, I think we should reopen this issue, from the memory 
dump as follows I found that `byte[]` occupies the most memory. Its content is 
'X'.  From this point , the most suspicious is still `SPARK-39387: 
BytesColumnVector should not throw RuntimeException due to overflow`

 

 

!image-2022-06-21-21-26-06-586.png!

!image-2022-06-21-21-26-26-563.png!

!image-2022-06-21-21-26-38-146.png!

 

 

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39519:
-
Attachment: image-2022-06-21-21-26-26-563.png

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39519:
-
Attachment: image-2022-06-21-21-26-38-146.png

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39519:
-
Attachment: image-2022-06-21-21-26-06-586.png

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png, 
> image-2022-06-21-21-26-06-586.png, image-2022-06-21-21-26-26-563.png, 
> image-2022-06-21-21-26-38-146.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39519:
-
Attachment: image-2022-06-21-21-25-35-951.png

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
> Attachments: image-2022-06-21-21-25-35-951.png
>
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39519) Test failure in SPARK-39387 with JDK 11

2022-06-21 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556896#comment-17556896
 ] 

Yang Jie commented on SPARK-39519:
--

I get a oom dump and will analyze it later

 

> Test failure in SPARK-39387 with JDK 11
> ---
>
> Key: SPARK-39519
> URL: https://issues.apache.org/jira/browse/SPARK-39519
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Yang Jie
>Priority: Major
>
> {code}
> [info] - SPARK-39387: BytesColumnVector should not throw RuntimeException due 
> to overflow *** FAILED *** (3 seconds, 393 milliseconds)
> [info]   org.apache.spark.SparkException: Job aborted.
> [info]   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:593)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:279)
> [info]   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:171)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:95)
> [info]   at 
> org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
> [info]   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
> [info]   at 
> org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:94)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
> {code}
> https://github.com/apache/spark/runs/6919076419?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556894#comment-17556894
 ] 

Apache Spark commented on SPARK-39545:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36942

> Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the 
> performance
> -
>
> Key: SPARK-39545
> URL: https://issues.apache.org/jira/browse/SPARK-39545
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> ExpressionSet ++ with 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39545:


Assignee: Apache Spark

> Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the 
> performance
> -
>
> Key: SPARK-39545
> URL: https://issues.apache.org/jira/browse/SPARK-39545
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> ExpressionSet ++ with 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556893#comment-17556893
 ] 

Apache Spark commented on SPARK-39545:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36942

> Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the 
> performance
> -
>
> Key: SPARK-39545
> URL: https://issues.apache.org/jira/browse/SPARK-39545
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> ExpressionSet ++ with 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39545:


Assignee: (was: Apache Spark)

> Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the 
> performance
> -
>
> Key: SPARK-39545
> URL: https://issues.apache.org/jira/browse/SPARK-39545
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> ExpressionSet ++ with 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39545:
-
Description: ExpressionSet ++ with 

> Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the 
> performance
> -
>
> Key: SPARK-39545
> URL: https://issues.apache.org/jira/browse/SPARK-39545
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> ExpressionSet ++ with 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Issue Type: Bug  (was: Improvement)

> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)
> lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())
> model = ovr.fit(df)model_path = 'temp' + "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39545) Override `concat` method for `ExpressionSet` in Scala 2.13 to improve the performance

2022-06-21 Thread Yang Jie (Jira)

Yang Jie created SPARK-39545:


 Summary: Override `concat` method for `ExpressionSet` in Scala 
2.13 to improve the performance
 Key: SPARK-39545
 URL: https://issues.apache.org/jira/browse/SPARK-39545
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())

model = ovr.fit(df)model_path = 'temp' + "/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())
> model = ovr.fit(df)model_path = 'temp' + "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)
lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())

model = ovr.fit(df)model_path = 'temp' + "/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())

model = ovr.fit(df)model_path = 'temp' + "/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)
> lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())
> model = ovr.fit(df)model_path = 'temp' + "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel

data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
> "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 
{code:java}
from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
data_path = "/sample_multiclass_classification_data.txt"
df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)

# set the name of rawPrediction column
ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
"/ovr_model"

# save and read back in
model.write().overwrite().save(model_path)
model2 = OneVsRestModel.load(model_path)
model2.getRawPredictionCol()

Output:
raw_prediction
'rawPrediction' {code}
 

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}

{{'rawPrediction'}}

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {code:java}
> from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel
> data_path = "/sample_multiclass_classification_data.txt"
> df = spark.read.format("libsvm").load(data_path)lr = LinearSVC(regParam=0.01)
> # set the name of rawPrediction column
> ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')
> print(ovr.getRawPredictionCol())model = ovr.fit(df)model_path = 'temp' + 
> "/ovr_model"
> # save and read back in
> model.write().overwrite().save(model_path)
> model2 = OneVsRestModel.load(model_path)
> model2.getRawPredictionCol()
> Output:
> raw_prediction
> 'rawPrediction' {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}

{{'rawPrediction'}}

 

  was:
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

 


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
> {{data_path = "/sample_multiclass_classification_data.txt"}}
> {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
> LinearSVC(regParam=0.01){}}}
> {{# set the name of rawPrediction column}}
> {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
> {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
> ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
> {{model.write().overwrite().save(model_path)}}
> {{model2 = OneVsRestModel.load(model_path)}}
> {{model2.getRawPredictionCol()}}
> {{Output:}}
> {{raw_prediction }}
> {{'rawPrediction'}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

koba updated SPARK-39544:
-
Description: 
The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
loading a trained model. This becomes an issue when I try to stack multiple One 
Vs Rest models in a pipeline. Code example below. 

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

 

  was:
The naming of `rawPredcitionCol` in `OneVsRest` does not persist after saving 
and loading a trained model. This becomes an issue when I try to stack multiple 
One Vs Rest models in a pipeline. Code example below. 

{{```}}

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

{{```}}


> setPredictionCol for OneVsRest does not persist when saving model to disk
> -
>
> Key: SPARK-39544
> URL: https://issues.apache.org/jira/browse/SPARK-39544
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0, 
> 3.2.1, 3.3.0
> Environment: Python 3.6
> Spark 3.2
>Reporter: koba
>Priority: Major
>
> The naming of rawPredcitionCol in OneVsRest does not persist after saving and 
> loading a trained model. This becomes an issue when I try to stack multiple 
> One Vs Rest models in a pipeline. Code example below. 
> {{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
> {{data_path = "/sample_multiclass_classification_data.txt"}}
> {{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
> LinearSVC(regParam=0.01){}}}
> {{# set the name of rawPrediction column}}
> {{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
> {{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
> ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
> {{model.write().overwrite().save(model_path)}}
> {{model2 = OneVsRestModel.load(model_path)}}
> {{model2.getRawPredictionCol()}}
> {{Output:}}
> {{raw_prediction }}{{'rawPrediction'}}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39544) setPredictionCol for OneVsRest does not persist when saving model to disk

2022-06-21 Thread koba (Jira)

koba created SPARK-39544:


 Summary: setPredictionCol for OneVsRest does not persist when 
saving model to disk
 Key: SPARK-39544
 URL: https://issues.apache.org/jira/browse/SPARK-39544
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.3.0, 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.0.3, 3.0.2, 
3.0.1, 3.0.0
 Environment: Python 3.6

Spark 3.2
Reporter: koba


The naming of `rawPredcitionCol` in `OneVsRest` does not persist after saving 
and loading a trained model. This becomes an issue when I try to stack multiple 
One Vs Rest models in a pipeline. Code example below. 

{{```}}

{{from pyspark.ml.classification import LinearSVC, OneVsRest, OneVsRestModel}}
{{data_path = "/sample_multiclass_classification_data.txt"}}
{{{}df = spark.read.format("libsvm").load(data_path){}}}{{{}lr = 
LinearSVC(regParam=0.01){}}}
{{# set the name of rawPrediction column}}
{{ovr = OneVsRest(classifier=lr, rawPredictionCol = 'raw_prediction')}}
{{{}print(ovr.getRawPredictionCol()){}}}{{{}model = 
ovr.fit(df){}}}{{{}model_path = 'temp' + "/ovr_model"{}}}
{{model.write().overwrite().save(model_path)}}
{{model2 = OneVsRestModel.load(model_path)}}
{{model2.getRawPredictionCol()}}

{{Output:}}

{{raw_prediction }}{{'rawPrediction'}}

{{```}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556859#comment-17556859
 ] 

Hyukjin Kwon commented on SPARK-38292:
--

you could try to leverage the approach like 
https://github.com/apache/spark/pull/36294 to set empty or null vlaues as 
non-existent values. 

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-21 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556858#comment-17556858
 ] 

Hyukjin Kwon commented on SPARK-38292:
--

can we control the options e.g., emptyValue or nullValue in CSV?

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39543:


Assignee: Apache Spark

> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> spark.range(0, 100).writeTo("t1").option("compression", 
> "zstd").using("parquet").create
> *before*
> gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet
> *after*
> gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556857#comment-17556857
 ] 

Apache Spark commented on SPARK-39543:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/36941

> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> spark.range(0, 100).writeTo("t1").option("compression", 
> "zstd").using("parquet").create
> *before*
> gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet
> *after*
> gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39543:


Assignee: (was: Apache Spark)

> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> spark.range(0, 100).writeTo("t1").option("compression", 
> "zstd").using("parquet").create
> *before*
> gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet
> *after*
> gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1

2022-06-21 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-39543:
-
Description: 
The option of DataFrameWriterV2 should be passed to storage properties if 
fallback to v1, to support something such as compressed formats, example:

spark.range(0, 100).writeTo("t1").option("compression", 
"zstd").using("parquet").create

*before*

gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet

*after*

gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...

 

  was:
The option of DataFrameWriterV2 should be passed to storage properties if 
fallback to v1, to support something such as compressed formats, example:

*before*

 

*after*

`spark.range(0, 100).writeTo("t1").option("compression", 
"zstd").using("parquet").create`

gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...

 


> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> spark.range(0, 100).writeTo("t1").option("compression", 
> "zstd").using("parquet").create
> *before*
> gen: part-0-644a65ed-0e7a-43d5-8d30-b610a0fb19dc-c000.snappy.parquet
> *after*
> gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38292) Support `na_filter` for pyspark.pandas.read_csv

2022-06-21 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556851#comment-17556851
 ] 

pralabhkumar commented on SPARK-38292:
--

[~itholic] [~hyukjin.kwon] 

Would like to discuss the logic 

The difference comes na_filter = False , when there are missing values . For 
.eg 

22,,1980-09-26

33,,1980-09-26

 

Pandas with na_filter , read it as its . However Spark will read missing value 
with null . This happens because of univocity-parsers , which reads missing 
value as null . 

 

Approach

in case of na_filter. 

Once file is  read  in namespace.py via reader.csv(patj)  , replace missing 
values with empty string (df.fillna("")). We also need to change the datatype 
of the column to string (as panda does). 

 

 

Please let me know , if its correct direction , i'll create a PR . 

> Support `na_filter` for pyspark.pandas.read_csv
> ---
>
> Key: SPARK-38292
> URL: https://issues.apache.org/jira/browse/SPARK-38292
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Haejoon Lee
>Priority: Major
>
> pandas support `na_filter` parameter for `read_csv` function. 
> (https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html)
> We also want to support this to follow the behavior of pandas.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1

2022-06-21 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-39543:
-
Description: 
The option of DataFrameWriterV2 should be passed to storage properties if 
fallback to v1, to support something such as compressed formats, example:

*before*

 

*after*

`spark.range(0, 100).writeTo("t1").option("compression", 
"zstd").using("parquet").create`

gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...

 

  was:
The option of DataFrameWriterV2 should be passed to storage properties if 
fallback to v1, to support something such as compressed formats, example:

*before*

 

*after*

 

 


> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> *before*
>  
> *after*
> `spark.range(0, 100).writeTo("t1").option("compression", 
> "zstd").using("parquet").create`
> gen: part-0-6eb9d1ae-8fdb-4428-aea3-bd6553954cdd-c000.zstd.parquet ...
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39515) Improve/recover scheduled jobs in GitHub Actions

2022-06-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39515:


Assignee: Hyukjin Kwon

> Improve/recover scheduled jobs in GitHub Actions
> 
>
> Key: SPARK-39515
> URL: https://issues.apache.org/jira/browse/SPARK-39515
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> There are five problems to address.
> *First*, the scheduled jobs are broken as below:
> https://github.com/apache/spark/actions/runs/2513261706
> https://github.com/apache/spark/actions/runs/2512750310
> https://github.com/apache/spark/actions/runs/2509238648
> https://github.com/apache/spark/actions/runs/2508246903
> https://github.com/apache/spark/actions/runs/2507327914
> https://github.com/apache/spark/actions/runs/2506654808
> https://github.com/apache/spark/actions/runs/2506143939
> https://github.com/apache/spark/actions/runs/2502449498
> https://github.com/apache/spark/actions/runs/2501400490
> https://github.com/apache/spark/actions/runs/2500407628
> https://github.com/apache/spark/actions/runs/2499722093
> https://github.com/apache/spark/actions/runs/2499196539
> https://github.com/apache/spark/actions/runs/2496544415
> https://github.com/apache/spark/actions/runs/2495444227
> https://github.com/apache/spark/actions/runs/2493402272
> https://github.com/apache/spark/actions/runs/2492759618
> https://github.com/apache/spark/actions/runs/2492227816
> See also https://github.com/apache/spark/pull/36899 or 
> https://github.com/apache/spark/pull/36890
> In the master branch, seems like at least Hadoop 2 build is broken currently.
> *Second*, it is very difficult to navigate scheduled jobs now. We should use 
> https://github.com/apache/spark/actions/workflows/build_and_test.yml?query=event%3Aschedule
>  link and manually search one by one.
> Since GitHub added the feature to import other workflow, we should leverage 
> this feature, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test_ansi.yml
>  and https://docs.github.com/en/actions/using-workflows/reusing-workflows. 
> Once we can separate them, it will be defined as a separate workflow.
> Namely, each scheduled job should be classified under "All workflows" at 
> https://github.com/apache/spark/actions so other developers can easily track 
> them.
> *Third*, we should set the scheduled jobs for branch-3.3, see also 
> https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L78-L83
>  for branch-3.2 job.
> *Forth*, we should improve duplicated test skipping logic. See also 
> https://github.com/apache/spark/pull/36413#issuecomment-1157205469 and 
> https://github.com/apache/spark/pull/36888
> *Fifth*, we should probably replace the base image 
> (https://github.com/apache/spark/blob/master/.github/workflows/build_and_test.yml#L302,
>  https://github.com/dongjoon-hyun/ApacheSparkGitHubActionImage) to plain 
> ubunto image w/ Docker image cache. See also 
> https://github.com/docker/build-push-action/blob/master/docs/advanced/cache.md



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-06-21 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39074:
-
Parent: (was: SPARK-39515)
Issue Type: Bug  (was: Sub-task)

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Minor
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39529) Refactor and merge all related job selection logic into precondition

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556837#comment-17556837
 ] 

Apache Spark commented on SPARK-39529:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36940

> Refactor and merge all related job selection logic into precondition 
> -
>
> Key: SPARK-39529
> URL: https://issues.apache.org/jira/browse/SPARK-39529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently there are three logics that choose which build to run.
> First is configure-jobs
> Second is precondition
> Third is the type of job (if it's scheduled or not).
> We should merge all to precondition.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39529) Refactor and merge all related job selection logic into precondition

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556838#comment-17556838
 ] 

Apache Spark commented on SPARK-39529:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36940

> Refactor and merge all related job selection logic into precondition 
> -
>
> Key: SPARK-39529
> URL: https://issues.apache.org/jira/browse/SPARK-39529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently there are three logics that choose which build to run.
> First is configure-jobs
> Second is precondition
> Third is the type of job (if it's scheduled or not).
> We should merge all to precondition.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage properties if fallback to v1

2022-06-21 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-39543:
-
Summary: The option of DataFrameWriterV2 should be passed to storage 
properties if fallback to v1  (was: The option of DataFrameWriterV2 should be 
passed to storage Properties if fallback to v1)

> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> *before*
>  
> *after*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage Properties if fallback to v1

2022-06-21 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-39543:
-
Summary: The option of DataFrameWriterV2 should be passed to storage 
Properties if fallback to v1  (was: The option of DataFrameWriterV2 should be 
passed to storage Properties)

> The option of DataFrameWriterV2 should be passed to storage Properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage Properties, to 
> support something such as compressed formats, example:
> **before**
>  
> **after**
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage Properties if fallback to v1

2022-06-21 Thread yikf (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf updated SPARK-39543:
-
Description: 
The option of DataFrameWriterV2 should be passed to storage properties if 
fallback to v1, to support something such as compressed formats, example:

*before*

 

*after*

 

 

  was:
The option of DataFrameWriterV2 should be passed to storage Properties, to 
support something such as compressed formats, example:

**before**

 

**after**

 

 


> The option of DataFrameWriterV2 should be passed to storage Properties if 
> fallback to v1
> 
>
> Key: SPARK-39543
> URL: https://issues.apache.org/jira/browse/SPARK-39543
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.4.0
>
>
> The option of DataFrameWriterV2 should be passed to storage properties if 
> fallback to v1, to support something such as compressed formats, example:
> *before*
>  
> *after*
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39543) The option of DataFrameWriterV2 should be passed to storage Properties

2022-06-21 Thread yikf (Jira)

yikf created SPARK-39543:


 Summary: The option of DataFrameWriterV2 should be passed to 
storage Properties
 Key: SPARK-39543
 URL: https://issues.apache.org/jira/browse/SPARK-39543
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: yikf
 Fix For: 3.4.0


The option of DataFrameWriterV2 should be passed to storage Properties, to 
support something such as compressed formats, example:

**before**

 

**after**

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556818#comment-17556818
 ] 

Apache Spark commented on SPARK-39542:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36939

> Improve YARN client mode to support IPv6
> 
>
> Key: SPARK-39542
> URL: https://issues.apache.org/jira/browse/SPARK-39542
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39542:


Assignee: (was: Apache Spark)

> Improve YARN client mode to support IPv6
> 
>
> Key: SPARK-39542
> URL: https://issues.apache.org/jira/browse/SPARK-39542
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39542:


Assignee: Apache Spark

> Improve YARN client mode to support IPv6
> 
>
> Key: SPARK-39542
> URL: https://issues.apache.org/jira/browse/SPARK-39542
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39542:
--
Component/s: PySpark

> Improve YARN client mode to support IPv6
> 
>
> Key: SPARK-39542
> URL: https://issues.apache.org/jira/browse/SPARK-39542
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, YARN
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39542) Improve YARN client mode to support IPv6

2022-06-21 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-39542:
-

 Summary: Improve YARN client mode to support IPv6
 Key: SPARK-39542
 URL: https://issues.apache.org/jira/browse/SPARK-39542
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM

2022-06-21 Thread liangyongyuan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17556806#comment-17556806
 ] 

liangyongyuan commented on SPARK-39541:
---

I want to try to solve this problem. I already have a solution and have tested 
it

> [Yarn] Diagnostics of yarn UI did not display the exception of driver when 
> driver exit before regiserAM
> ---
>
> Key: SPARK-39541
> URL: https://issues.apache.org/jira/browse/SPARK-39541
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.3.0
>Reporter: liangyongyuan
>Priority: Major
>
> If commit a job in yarn cluster mode and driver exited before 
> registerAM，Diagnostics of yarn UI did not show the exception that was throwed 
> by driver .Yarn UI only show :
> Application application_xxx failed 1 times (global limit =10; local limit is 
> =1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13
>  
> User must view spark log to find the real reason.for example,spark log shows 
> {code:java}
> 2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: 
> User class threw exception: java.lang.ArithmeticException: / by zero
> java.lang.ArithmeticException: / by zero
>   at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10)
>   at org.examples.appErrorDemo3.main(appErrorDemo3.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736)
>  {code}
>  
> The reason of this issue is that if driver would not call unregisterAM exited 
> before registerAM ，then yarn UI could not show the real diagnostic 
> information.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39541) [Yarn] Diagnostics of yarn UI did not display the exception of driver when driver exit before regiserAM

2022-06-21 Thread liangyongyuan (Jira)

liangyongyuan created SPARK-39541:
-

 Summary: [Yarn] Diagnostics of yarn UI did not display the 
exception of driver when driver exit before regiserAM
 Key: SPARK-39541
 URL: https://issues.apache.org/jira/browse/SPARK-39541
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 3.3.0
Reporter: liangyongyuan


If commit a job in yarn cluster mode and driver exited before 
registerAM，Diagnostics of yarn UI did not show the exception that was throwed 
by driver .Yarn UI only show :

Application application_xxx failed 1 times (global limit =10; local limit is 
=1) due to AM Container for appattempt_xxx_01 exited with exitCode: 13

 

User must view spark log to find the real reason.for example,spark log shows 
{code:java}
2022-06-21,17:58:28,273 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: 
User class threw exception: java.lang.ArithmeticException: / by zero
java.lang.ArithmeticException: / by zero
at org.examples.appErrorDemo3$.main(appErrorDemo3.scala:10)
at org.examples.appErrorDemo3.main(appErrorDemo3.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:736)
 {code}
 

The reason of this issue is that if driver would not call unregisterAM exited 
before registerAM ，then yarn UI could not show the real diagnostic information.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 110 matches

Mail list logo