date:20220602

[jira] [Updated] (SPARK-39357) pmCache memory leak caused by IsolatedClassLoader

2022-06-02 Thread tianshuang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tianshuang updated SPARK-39357:
---
Attachment: JVM Heap Long Lived Pool.jpg
JVM Heap.jpg

> pmCache memory leak caused by IsolatedClassLoader
> -
>
> Key: SPARK-39357
> URL: https://issues.apache.org/jira/browse/SPARK-39357
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.2.1
>Reporter: tianshuang
>Priority: Major
> Attachments: JVM Heap Long Lived Pool.jpg, JVM Heap.jpg, 
> Xnip2022-06-01_23-09-35.jpg, Xnip2022-06-01_23-19-35.jpeg, 
> Xnip2022-06-01_23-32-39.jpg
>
>
> I found this bug in Spark 2.4.4, because the related code has not changed, so 
> this bug still exists on master, the following is a brief description of this 
> bug:
> In May 2015, 
> [SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568]
>  introduced isolated classloader for HiveMetastore to support Hive 
> multi-version loading, but this PR resulted in [RawStore cleanup 
> mechanism|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
>  #L27-L42] is broken because the `ThreadWithGarbageCleanup` class used by 
> `HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and 
> `HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source 
> code of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = 
> HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the 
> `threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by 
> AppClassLoader), and in the process of thread execution, the `client` 
> actually created by isolatedClassLoader, in the process of obtaining 
> `RawStore` instance through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` 
> instance is set to `threadLocalMS`, but the static `threadLocalMS` instance 
> belongs to `HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the 
> set and get methods do not operate on the same `threadLocalMS` instance, so 
> in `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained 
> `RawStore` instance is null, so the subsequent `RawStore` cleaning logic does 
> not take effect, because the `shutdown` method of `RawStore` instance is not 
> called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak.
> Long-running Spark ThriftServer end up with frequent GCs, resulting in poor 
> performance.
> I analyzed the heap dump using MAT and executed the following OQL: `SELECT * 
> FROM INSTANCEOF java.lang.Class c WHERE c.@displayName.contains("class 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances 
> of the `HMSHandler` *Class* can be found in the heap. Also know that they 
> each hold a static `threadLocalMS` instance.
> We execute the following OQL: `select * from 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that 
> `pmCache` of the `JDOPersistenceManagerFactory` instance has over 200,000 
> elements and consumes 1.3GB of memory.
> We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c 
> WHERE c.@displayName.contains("class 
> org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see 
> that there is no element in the static instance `threadRawStoreMap` of 
> `ThreadFactoryWithGarbageCleanup`, which confirms the above statement, 
> because `HMSHandler.getRawStore()` in 
> `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the 
> `threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of 
> `threadLocalMS` instance in `HMSHandler`(loaded by 
> IsolatedClassLoader$$anon$1).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39357) pmCache memory leak caused by IsolatedClassLoader

2022-06-02 Thread tianshuang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

tianshuang updated SPARK-39357:
---
Description: 
I found this bug in Spark 2.4.4, because the related code has not changed, so 
this bug still exists on master, the following is a brief description of this 
bug:

In May 2015, 
[SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568]
 introduced isolated classloader for HiveMetastore to support Hive 
multi-version loading, but this PR resulted in [RawStore cleanup 
mechanism|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
 #L27-L42] is broken because the `ThreadWithGarbageCleanup` class used by 
`HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and 
`HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code 
of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = 
HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the 
`threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by 
AppClassLoader), and in the process of thread execution, the `client` actually 
created by isolatedClassLoader, in the process of obtaining `RawStore` instance 
through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to 
`threadLocalMS`, but the static `threadLocalMS` instance belongs to 
`HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get 
methods do not operate on the same `threadLocalMS` instance, so in 
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained 
`RawStore` instance is null, so the subsequent `RawStore` cleaning logic does 
not take effect, because the `shutdown` method of `RawStore` instance is not 
called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak.

Long-running Spark ThriftServer end up with frequent GCs, resulting in poor 
performance.

I analyzed the heap dump using MAT and executed the following OQL: `SELECT * 
FROM INSTANCEOF java.lang.Class c WHERE c.@displayName.contains("class 
org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances of 
the `HMSHandler` *Class* can be found in the heap. Also know that they each 
hold a static `threadLocalMS` instance.

We execute the following OQL: `select * from 
org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that 
`pmCache` of the `JDOPersistenceManagerFactory` instance has over 200,000 
elements and consumes 1.3GB of memory.

We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c WHERE 
c.@displayName.contains("class 
org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see 
that there is no element in the static instance `threadRawStoreMap` of 
`ThreadFactoryWithGarbageCleanup`, which confirms the above statement, because 
`HMSHandler.getRawStore()` in 
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the 
`threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of 
`threadLocalMS` instance in `HMSHandler`(loaded by IsolatedClassLoader$$anon$1).

  was:
I found this bug in Spark 2.4.4, because the related code has not changed, so 
this bug still exists on master, the following is a brief description of this 
bug:

In May 2015, 
[SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568]
 introduced isolated classloader for HiveMetastore to support Hive 
multi-version loading, but this PR resulted in [RawStore cleanup 
mechanism|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java
 #L27-L42] is broken because the `ThreadWithGarbageCleanup` class used by 
`HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and 
`HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code 
of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = 
HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the 
`threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by 
AppClassLoader), and in the process of thread execution, the `client` actually 
created by isolatedClassLoader, in the process of obtaining `RawStore` instance 
through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to 
`threadLocalMS`, but the static `threadLocalMS` instance belongs to 
`HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get 
methods do not operate on the same `threadLocalMS` instance, so in 
`ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained 
`RawStore` instance is null, so the subsequent `RawStore` cleaning logic does 
not take effect, because the `shutdown` method of `RawStore` instance is not 
called, resulting in `pmCache` of `JDOPersistenceManagerFactor

[jira] [Updated] (SPARK-39369) Use JAVA_OPTS for AppVeyer build to increase the memory properly

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39369:
-
Description: 

https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704

AppVeyor build is being failed because of the lack of memory.
Should use JAVA_OPTS to make the memory configured properly.

  was:
https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704

AppVeyor build is being failed because of the lack of memory. We should 
increase it to make the build pass


> Use JAVA_OPTS for AppVeyer build to increase the memory properly
> 
>
> Key: SPARK-39369
> URL: https://issues.apache.org/jira/browse/SPARK-39369
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704
> AppVeyor build is being failed because of the lack of memory.
> Should use JAVA_OPTS to make the memory configured properly.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39369) Use JAVA_OPTS for AppVeyer build to increase the memory properly

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39369:
-
Summary: Use JAVA_OPTS for AppVeyer build to increase the memory properly  
(was: Increase memory in AppVeyor build)

> Use JAVA_OPTS for AppVeyer build to increase the memory properly
> 
>
> Key: SPARK-39369
> URL: https://issues.apache.org/jira/browse/SPARK-39369
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704
> AppVeyor build is being failed because of the lack of memory. We should 
> increase it to make the build pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37623) Support ANSI Aggregate Function: regr_intercept

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-37623.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36708
[https://github.com/apache/spark/pull/36708]

> Support ANSI Aggregate Function: regr_intercept
> ---
>
> Key: SPARK-37623
> URL: https://issues.apache.org/jira/browse/SPARK-37623
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> REGR_INTERCEPT is an ANSI aggregate functions. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37623) Support ANSI Aggregate Function: regr_intercept

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-37623:


Assignee: jiaan.geng

> Support ANSI Aggregate Function: regr_intercept
> ---
>
> Key: SPARK-37623
> URL: https://issues.apache.org/jira/browse/SPARK-37623
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> REGR_INTERCEPT is an ANSI aggregate functions. many database support it.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39259:
-
Fix Version/s: 3.3.0

> Timestamps returned by now() and equivalent functions are not consistent in 
> subqueries
> --
>
> Key: SPARK-39259
> URL: https://issues.apache.org/jira/browse/SPARK-39259
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
> Environment: Reproduced in the Spark Shell on the current 3.4.0 
> snapshot
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> Timestamp evaluation in not consistent across subqueries. As an example for 
> the Spark Shell
>  
> {code:java}
> sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code}
> Returns an empty result.
>  
> The root cause is that 
> [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74]
>  does not iterate into subqueries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545748#comment-17545748
 ] 

Apache Spark commented on SPARK-39371:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/36757

> Review and fix issues in Scala/Java API docs of Core module
> ---
>
> Key: SPARK-39371
> URL: https://issues.apache.org/jira/browse/SPARK-39371
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39371:


Assignee: (was: Apache Spark)

> Review and fix issues in Scala/Java API docs of Core module
> ---
>
> Key: SPARK-39371
> URL: https://issues.apache.org/jira/browse/SPARK-39371
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545747#comment-17545747
 ] 

Apache Spark commented on SPARK-39371:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/36757

> Review and fix issues in Scala/Java API docs of Core module
> ---
>
> Key: SPARK-39371
> URL: https://issues.apache.org/jira/browse/SPARK-39371
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yuanjian Li
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39371:


Assignee: Apache Spark

> Review and fix issues in Scala/Java API docs of Core module
> ---
>
> Key: SPARK-39371
> URL: https://issues.apache.org/jira/browse/SPARK-39371
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module

2022-06-02 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-39371:
---

 Summary: Review and fix issues in Scala/Java API docs of Core 
module
 Key: SPARK-39371
 URL: https://issues.apache.org/jira/browse/SPARK-39371
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Yuanjian Li






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39320) Add the MEDIAN() function

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39320.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36714
[https://github.com/apache/spark/pull/36714]

> Add the MEDIAN() function
> -
>
> Key: SPARK-39320
> URL: https://issues.apache.org/jira/browse/SPARK-39320
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.4.0
>
>
> Add the MEDIAN() function which can be implemented as *a specific case of 
> PERCENTILE_CONT where the percentile value defaults to 0.5.*



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39320) Add the MEDIAN() function

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39320:


Assignee: jiaan.geng

> Add the MEDIAN() function
> -
>
> Key: SPARK-39320
> URL: https://issues.apache.org/jira/browse/SPARK-39320
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
>
> Add the MEDIAN() function which can be implemented as *a specific case of 
> PERCENTILE_CONT where the percentile value defaults to 0.5.*



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39370) Inline type hints in PySpark

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39370.
--
Fix Version/s: 3.3.0
   Resolution: Done

> Inline type hints in PySpark
> 
>
> Key: SPARK-39370
> URL: https://issues.apache.org/jira/browse/SPARK-39370
> Project: Spark
>  Issue Type: Epic
>  Components: PySpark
>Affects Versions: 3.2.1, 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Critical
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37396) Inline type hint files for files in python/pyspark/mllib

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37396:
-
Epic Link: SPARK-39370

> Inline type hint files for files in python/pyspark/mllib
> 
>
> Key: SPARK-37396
> URL: https://issues.apache.org/jira/browse/SPARK-37396
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37093) Inline type hints python/pyspark/streaming

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37093:
-
Epic Link: SPARK-39370

> Inline type hints python/pyspark/streaming
> --
>
> Key: SPARK-37093
> URL: https://issues.apache.org/jira/browse/SPARK-37093
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37395) Inline type hint files for files in python/pyspark/ml

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37395:
-
Epic Link: SPARK-39370

> Inline type hint files for files in python/pyspark/ml
> -
>
> Key: SPARK-37395
> URL: https://issues.apache.org/jira/browse/SPARK-37395
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36845) Inline type hint files for files in python/pyspark/sql

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36845:
-
Epic Link: SPARK-39370

> Inline type hint files for files in python/pyspark/sql
> --
>
> Key: SPARK-36845
> URL: https://issues.apache.org/jira/browse/SPARK-36845
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Xinrong Meng
>Priority: Major
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37094) Inline type hints for files in python/pyspark

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37094:
-
Epic Link: SPARK-39370

> Inline type hints for files in python/pyspark
> -
>
> Key: SPARK-37094
> URL: https://issues.apache.org/jira/browse/SPARK-37094
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: dch nguyen
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39370) Inline type hints in PySpark

2022-06-02 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-39370:


 Summary: Inline type hints in PySpark
 Key: SPARK-39370
 URL: https://issues.apache.org/jira/browse/SPARK-39370
 Project: Spark
  Issue Type: Epic
  Components: PySpark
Affects Versions: 3.2.1, 3.3.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39366) BlockInfoManager should not release write locks on task end

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39366:


Assignee: Herman van Hövell  (was: Apache Spark)

> BlockInfoManager should not release write locks on task end
> ---
>
> Key: SPARK-39366
> URL: https://issues.apache.org/jira/browse/SPARK-39366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The BlockInfoManager releases all locks held by a task when the task is done. 
> It also release write locks, the problem with that is that a thread (other 
> than the main task thread) might still be modifying the block. By releasing 
> it the block now seems readable, and a reader might observe a block in a 
> partial or non-existent state.
> This is fortunately not the massive problem as it appears to be because the 
> BlockManager (only place where we write blocks) is well behaved and always 
> puts the block in a consistent state. This means the errors caused by this 
> are only transient.
> Given the fact that the write code is well behaved we don't need to release 
> the write locks on task end. We should remove that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39366) BlockInfoManager should not release write locks on task end

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39366:


Assignee: Apache Spark  (was: Herman van Hövell)

> BlockInfoManager should not release write locks on task end
> ---
>
> Key: SPARK-39366
> URL: https://issues.apache.org/jira/browse/SPARK-39366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>
> The BlockInfoManager releases all locks held by a task when the task is done. 
> It also release write locks, the problem with that is that a thread (other 
> than the main task thread) might still be modifying the block. By releasing 
> it the block now seems readable, and a reader might observe a block in a 
> partial or non-existent state.
> This is fortunately not the massive problem as it appears to be because the 
> BlockManager (only place where we write blocks) is well behaved and always 
> puts the block in a consistent state. This means the errors caused by this 
> are only transient.
> Given the fact that the write code is well behaved we don't need to release 
> the write locks on task end. We should remove that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39366) BlockInfoManager should not release write locks on task end

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545728#comment-17545728
 ] 

Apache Spark commented on SPARK-39366:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/36751

> BlockInfoManager should not release write locks on task end
> ---
>
> Key: SPARK-39366
> URL: https://issues.apache.org/jira/browse/SPARK-39366
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>
> The BlockInfoManager releases all locks held by a task when the task is done. 
> It also release write locks, the problem with that is that a thread (other 
> than the main task thread) might still be modifying the block. By releasing 
> it the block now seems readable, and a reader might observe a block in a 
> partial or non-existent state.
> This is fortunately not the massive problem as it appears to be because the 
> BlockManager (only place where we write blocks) is well behaved and always 
> puts the block in a consistent state. This means the errors caused by this 
> are only transient.
> Given the fact that the write code is well behaved we don't need to release 
> the write locks on task end. We should remove that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39369) Increase memory in AppVeyor build

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39369:


Assignee: (was: Apache Spark)

> Increase memory in AppVeyor build
> -
>
> Key: SPARK-39369
> URL: https://issues.apache.org/jira/browse/SPARK-39369
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704
> AppVeyor build is being failed because of the lack of memory. We should 
> increase it to make the build pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39369) Increase memory in AppVeyor build

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545727#comment-17545727
 ] 

Apache Spark commented on SPARK-39369:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/36756

> Increase memory in AppVeyor build
> -
>
> Key: SPARK-39369
> URL: https://issues.apache.org/jira/browse/SPARK-39369
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704
> AppVeyor build is being failed because of the lack of memory. We should 
> increase it to make the build pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39369) Increase memory in AppVeyor build

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39369:


Assignee: Apache Spark

> Increase memory in AppVeyor build
> -
>
> Key: SPARK-39369
> URL: https://issues.apache.org/jira/browse/SPARK-39369
> Project: Spark
>  Issue Type: Test
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704
> AppVeyor build is being failed because of the lack of memory. We should 
> increase it to make the build pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545726#comment-17545726
 ] 

Yuming Wang commented on SPARK-29260:
-

Please backport HIVE-8472 to your Hive metastore service if you see this error 
message:
{noformat}
Hive metastore does not support altering database location
{noformat}

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-29260.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36750
[https://github.com/apache/spark/pull/36750]

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.4.0
>
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-29260:
---

Assignee: Chao Sun

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Chao Sun
>Priority: Major
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39351) ShowCreateTable should redact properties

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39351.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36736
[https://github.com/apache/spark/pull/36736]

> ShowCreateTable should redact properties
> 
>
> Key: SPARK-39351
> URL: https://issues.apache.org/jira/browse/SPARK-39351
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.4.0
>
>
> ShowCreateTable should redact properties



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39351) ShowCreateTable should redact properties

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39351:


Assignee: angerszhu

> ShowCreateTable should redact properties
> 
>
> Key: SPARK-39351
> URL: https://issues.apache.org/jira/browse/SPARK-39351
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> ShowCreateTable should redact properties



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39369) Increase memory in AppVeyor build

2022-06-02 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-39369:


 Summary: Increase memory in AppVeyor build
 Key: SPARK-39369
 URL: https://issues.apache.org/jira/browse/SPARK-39369
 Project: Spark
  Issue Type: Test
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704

AppVeyor build is being failed because of the lack of memory. We should 
increase it to make the build pass



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39356) Add option to skip initial message in Pregel API

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-39356:
-
Target Version/s:   (was: 3.3.0)

> Add option to skip initial message in Pregel API
> 
>
> Key: SPARK-39356
> URL: https://issues.apache.org/jira/browse/SPARK-39356
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 3.2.1
>Reporter: Aaron Zolnai-Lucas
>Priority: Minor
>  Labels: graphx, pregel
>
> The current (3.2.1) [Pregel 
> API|https://github.com/apache/spark/blob/5a3ba9b0b301a3b0c43f8d0d88e2b6bdce57d0e6/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala#L117]
>  takes a parameter {{initialMsg: A}} where {{A : scala.reflect.ClassTag}} is 
> the message type for the Pregel iterations. At the start of the iterative 
> process, the user-supplied vertex update method {{vprog}} is called with the 
> initial message.
> However, in some cases, the start point for a message passing scheme is best 
> described by starting with the {{message}} phase rather than the {{vprog}} 
> phase, and in many cases the first message depends on individual vertex data 
> (instead of a static initial message). In these cases, users are forced to 
> add boilerplate to their {{vprog}} function to check if the message received 
> is the {{initialMessage}} and ignore the message (leave the node state 
> unchanged) if it is. This leads to less efficient (due to extra iteration and 
> check) and less readable code.
>  
> My proposed solution is to change {{initialMsg}} to a parameter of type 
> {{Option[A]}} with default {{{}None{}}}, and then inside {{Pregel.apply}} 
> function, set:
> {code:scala}
> var g = initialMsg match {
>   case Some(msg) => graph.mapVertices((vid, vdata) => vprog(vid, vdata, msg))
>   case _ => graph
> }
> {code}
> This way, the user chooses whether to start the iteration from the 
> {{message}} or {{vprog}} phase. I believe this small change could improve 
> user code readability and efficiency.
> Note: The signature of {{GraphOps.pregel}} would have to be changed to match
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-39336) Redact table/partition properties

2022-06-02 Thread Hyukjin Kwon (Jira)



[ https://issues.apache.org/jira/browse/SPARK-39336 ]


Hyukjin Kwon deleted comment on SPARK-39336:
--

was (Author: apachespark):
User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/36751

> Redact table/partition properties
> -
>
> Key: SPARK-39336
> URL: https://issues.apache.org/jira/browse/SPARK-39336
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39368:


Assignee: Apache Spark

> Move RewritePredicateSubquery into InjectRuntimeFilter
> --
>
> Key: SPARK-39368
> URL: https://issues.apache.org/jira/browse/SPARK-39368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39368:


Assignee: (was: Apache Spark)

> Move RewritePredicateSubquery into InjectRuntimeFilter
> --
>
> Key: SPARK-39368
> URL: https://issues.apache.org/jira/browse/SPARK-39368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545721#comment-17545721
 ] 

Apache Spark commented on SPARK-39368:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36755

> Move RewritePredicateSubquery into InjectRuntimeFilter
> --
>
> Key: SPARK-39368
> URL: https://issues.apache.org/jira/browse/SPARK-39368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545720#comment-17545720
 ] 

Apache Spark commented on SPARK-39368:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36755

> Move RewritePredicateSubquery into InjectRuntimeFilter
> --
>
> Key: SPARK-39368
> URL: https://issues.apache.org/jira/browse/SPARK-39368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter

2022-06-02 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545719#comment-17545719
 ] 

Yuming Wang commented on SPARK-39368:
-

https://github.com/apache/spark/pull/36755

> Move RewritePredicateSubquery into InjectRuntimeFilter
> --
>
> Key: SPARK-39368
> URL: https://issues.apache.org/jira/browse/SPARK-39368
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter

2022-06-02 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-39368:
---

 Summary: Move RewritePredicateSubquery into InjectRuntimeFilter
 Key: SPARK-39368
 URL: https://issues.apache.org/jira/browse/SPARK-39368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37649:
-
Description: 
pandas API on Spark currently sets {{compute.default_index_type}} to 
{{sequence}} which relies on sending all data to one executor that easily 
causes OOM.

We should better switch to {{distributed-sequence}} type that truly distributes 
the data.

With this change, we can now leverage 
https://issues.apache.org/jira/browse/SPARK-36559 and 
https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users 
will benefit a lot of performance improvement.

  was:
pandas API on Spark currently sets {{compute.default_index_type}} to 
{{sequence}} which relies on sending all data to one executor that easily 
causes OOM.

We should better switch to {{distributed-sequence}} type that truly distributes 
the data.


> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.
> With this change, we can now leverage 
> https://issues.apache.org/jira/browse/SPARK-36559 and 
> https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users 
> will benefit a lot of performance improvement.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-37649:
-
Priority: Critical  (was: Major)

> Switch default index to distributed-sequence by default in pandas API on Spark
> --
>
> Key: SPARK-37649
> URL: https://issues.apache.org/jira/browse/SPARK-37649
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Critical
>  Labels: release-notes
> Fix For: 3.3.0
>
>
> pandas API on Spark currently sets {{compute.default_index_type}} to 
> {{sequence}} which relies on sending all data to one executor that easily 
> causes OOM.
> We should better switch to {{distributed-sequence}} type that truly 
> distributes the data.
> With this change, we can now leverage 
> https://issues.apache.org/jira/browse/SPARK-36559 and 
> https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users 
> will benefit a lot of performance improvement.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module

2022-06-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39367.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36754
[https://github.com/apache/spark/pull/36754]

> Review and fix issues in Scala/Java API docs of SQL module
> --
>
> Key: SPARK-39367
> URL: https://issues.apache.org/jira/browse/SPARK-39367
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39367:


Assignee: Gengliang Wang  (was: Apache Spark)

> Review and fix issues in Scala/Java API docs of SQL module
> --
>
> Key: SPARK-39367
> URL: https://issues.apache.org/jira/browse/SPARK-39367
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39367:


Assignee: Apache Spark  (was: Gengliang Wang)

> Review and fix issues in Scala/Java API docs of SQL module
> --
>
> Key: SPARK-39367
> URL: https://issues.apache.org/jira/browse/SPARK-39367
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545626#comment-17545626
 ] 

Apache Spark commented on SPARK-39367:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36754

> Review and fix issues in Scala/Java API docs of SQL module
> --
>
> Key: SPARK-39367
> URL: https://issues.apache.org/jira/browse/SPARK-39367
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, SQL
>Affects Versions: 3.3.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module

2022-06-02 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-39367:
--

 Summary: Review and fix issues in Scala/Java API docs of SQL module
 Key: SPARK-39367
 URL: https://issues.apache.org/jira/browse/SPARK-39367
 Project: Spark
  Issue Type: Task
  Components: Documentation, SQL
Affects Versions: 3.3.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545609#comment-17545609
 ] 

Apache Spark commented on SPARK-39259:
--

User 'olaky' has created a pull request for this issue:
https://github.com/apache/spark/pull/36752

> Timestamps returned by now() and equivalent functions are not consistent in 
> subqueries
> --
>
> Key: SPARK-39259
> URL: https://issues.apache.org/jira/browse/SPARK-39259
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
> Environment: Reproduced in the Spark Shell on the current 3.4.0 
> snapshot
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.4.0
>
>
> Timestamp evaluation in not consistent across subqueries. As an example for 
> the Spark Shell
>  
> {code:java}
> sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code}
> Returns an empty result.
>  
> The root cause is that 
> [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74]
>  does not iterate into subqueries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545608#comment-17545608
 ] 

Apache Spark commented on SPARK-39259:
--

User 'olaky' has created a pull request for this issue:
https://github.com/apache/spark/pull/36753

> Timestamps returned by now() and equivalent functions are not consistent in 
> subqueries
> --
>
> Key: SPARK-39259
> URL: https://issues.apache.org/jira/browse/SPARK-39259
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
> Environment: Reproduced in the Spark Shell on the current 3.4.0 
> snapshot
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.4.0
>
>
> Timestamp evaluation in not consistent across subqueries. As an example for 
> the Spark Shell
>  
> {code:java}
> sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code}
> Returns an empty result.
>  
> The root cause is that 
> [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74]
>  does not iterate into subqueries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39295) Improve documentation of pandas API support list.

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39295:
-
Fix Version/s: 3.3.0

> Improve documentation of pandas API support list.
> -
>
> Key: SPARK-39295
> URL: https://issues.apache.org/jira/browse/SPARK-39295
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyunwoo Park
>Assignee: Hyunwoo Park
>Priority: Major
> Fix For: 3.3.0, 3.4.0
>
>
> The description provided in the supported pandas API list document or the 
> code comment needs improvement. Also, there are cases where the link of the 
> function property provided in the document is not connected, so it needs to 
> be corrected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39259.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36654
[https://github.com/apache/spark/pull/36654]

> Timestamps returned by now() and equivalent functions are not consistent in 
> subqueries
> --
>
> Key: SPARK-39259
> URL: https://issues.apache.org/jira/browse/SPARK-39259
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
> Environment: Reproduced in the Spark Shell on the current 3.4.0 
> snapshot
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Major
> Fix For: 3.4.0
>
>
> Timestamp evaluation in not consistent across subqueries. As an example for 
> the Spark Shell
>  
> {code:java}
> sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code}
> Returns an empty result.
>  
> The root cause is that 
> [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74]
>  does not iterate into subqueries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39259:


Assignee: Jan-Ole Sasse

> Timestamps returned by now() and equivalent functions are not consistent in 
> subqueries
> --
>
> Key: SPARK-39259
> URL: https://issues.apache.org/jira/browse/SPARK-39259
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.2.1
> Environment: Reproduced in the Spark Shell on the current 3.4.0 
> snapshot
>Reporter: Jan-Ole Sasse
>Assignee: Jan-Ole Sasse
>Priority: Major
>
> Timestamp evaluation in not consistent across subqueries. As an example for 
> the Spark Shell
>  
> {code:java}
> sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code}
> Returns an empty result.
>  
> The root cause is that 
> [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74]
>  does not iterate into subqueries



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39336) Redact table/partition properties

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545592#comment-17545592
 ] 

Apache Spark commented on SPARK-39336:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/36751

> Redact table/partition properties
> -
>
> Key: SPARK-39336
> URL: https://issues.apache.org/jira/browse/SPARK-39336
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39336) Redact table/partition properties

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39336:


Assignee: Apache Spark

> Redact table/partition properties
> -
>
> Key: SPARK-39336
> URL: https://issues.apache.org/jira/browse/SPARK-39336
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39336) Redact table/partition properties

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39336:


Assignee: (was: Apache Spark)

> Redact table/partition properties
> -
>
> Key: SPARK-39336
> URL: https://issues.apache.org/jira/browse/SPARK-39336
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter

2022-06-02 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-39283:
---
Fix Version/s: 3.3.0
   (was: 3.3.1)

> Spark tasks stuck forever due to deadlock between TaskMemoryManager and 
> UnsafeExternalSorter
> 
>
> Key: SPARK-39283
> URL: https://issues.apache.org/jira/browse/SPARK-39283
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.2
>Reporter: Sandeep Pal
>Assignee: Sandeep Pal
>Priority: Critical
>  Labels: Deadlock, spark3.0
> Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2
>
> Attachments: DeadlockSparkTasks.png
>
>
> We are seems this deadlock between {{TaskMemoryManager}} and 
> {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is 
> successful but sometimes we have to do hacky ways to break the deadlocks such 
> as turning down the worker machines explicitly. 
> Below is the thread dump from the Spark UI showing the deadlock :
> !DeadlockSparkTasks.png!
>  
> I believe there was a related Jira on the similar deadlock between the same 
> threads and it was resolved. 
> https://issues.apache.org/jira/browse/SPARK-27338
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39361) Stop using Log4J2's extended throwable logging pattern in default logging configurations

2022-06-02 Thread Josh Rosen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-39361.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 36747
[https://github.com/apache/spark/pull/36747]

> Stop using Log4J2's extended throwable logging pattern in default logging 
> configurations
> 
>
> Key: SPARK-39361
> URL: https://issues.apache.org/jira/browse/SPARK-39361
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Major
> Fix For: 3.3.0
>
>
> This PR addresses a performance problem in Log4J 2 related to exception 
> logging: in certain scenarios I observed that Log4J2's exception stacktrace 
> logging can be ~10x slower than Log4J 1.
> The problem stems from a new log pattern format in Log4J2 called ["extended 
> exception"|https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternExtendedException],
>  which enriches the regular stacktrace string with information on the name of 
> the JAR files that contained the classes in each stack frame.
> Log4J queries the classloader to determine the source JAR for each class. 
> This isn't cheap, but this information is cached and reused in future 
> exception logging calls. In certain scenarios involving runtime-generated 
> classes, this lookup will fail and the failed lookup result will _not_ be 
> cached. As a result, expensive classloading operations will be performed 
> every time such an exception is logged. In addition to being very slow, these 
> operations take out a lock on the classloader and thus can cause severe lock 
> contention if multiple threads are logging errors. This issue is described in 
> more detail in a comment on a Log4J2 JIRA and in a linked blogpost: 
> https://issues.apache.org/jira/browse/LOG4J2-2391?focusedCommentId=16667140&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16667140
>  . Spark frequently uses generated classes and lambdas and thus Spark 
> executor logs will almost always trigger this edge-case and suffer from poor 
> performance.
> By default, if you do not specify an explicit exception format in your 
> logging pattern then Log4J2 will add this "extended exception" pattern (see 
> PatternLayout's {{alwaysWriteExceptions}} flag in Log4J's documentation, plus 
> [the code implementing that 
> flag|https://github.com/apache/logging-log4j2/blob/d6c8ab0863c551cdf0f8a5b1966ab45e3cddf572/log4j-core/src/main/java/org/apache/logging/log4j/core/pattern/PatternParser.java#L206-L209]
>  in Log4J2).
> In this PR, I have updated Spark's default Log4J2 configurations so that each 
> pattern layout includes an explicit {{%ex}} so that it uses the normal 
> (non-extended) exception logging format.
> Although it's true that any program logging exceptions at a high rate should 
> probably just fix the source of the exceptions, I think it's still a good 
> idea for us to try to fix this out-of-the-box performance difference so that 
> users' existing workloads do not regress when upgrading to 3.3.0.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545512#comment-17545512
 ] 

Apache Spark commented on SPARK-29260:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/36750

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29260:


Assignee: Apache Spark

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-29260:


Assignee: (was: Apache Spark)

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545511#comment-17545511
 ] 

Apache Spark commented on SPARK-29260:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/36750

> Enable supported Hive metastore versions once it support altering database 
> location
> ---
>
> Key: SPARK-29260
> URL: https://issues.apache.org/jira/browse/SPARK-29260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39366) BlockInfoManager should not release write locks on task end

2022-06-02 Thread Jira

Herman van Hövell created SPARK-39366:
-

 Summary: BlockInfoManager should not release write locks on task 
end
 Key: SPARK-39366
 URL: https://issues.apache.org/jira/browse/SPARK-39366
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: Herman van Hövell
Assignee: Herman van Hövell


The BlockInfoManager releases all locks held by a task when the task is done. 
It also release write locks, the problem with that is that a thread (other than 
the main task thread) might still be modifying the block. By releasing it the 
block now seems readable, and a reader might observe a block in a partial or 
non-existent state.

This is fortunately not the massive problem as it appears to be because the 
BlockManager (only place where we write blocks) is well behaved and always puts 
the block in a consistent state. This means the errors caused by this are only 
transient.

Given the fact that the write code is well behaved we don't need to release the 
write locks on task end. We should remove that.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39365) Truncate fragment of query context if it is too long

2022-06-02 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-39365:
---
Summary: Truncate fragment of query context if it is too long  (was: 
Truncate fragment if it is too long)

> Truncate fragment of query context if it is too long
> 
>
> Key: SPARK-39365
> URL: https://issues.apache.org/jira/browse/SPARK-39365
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39365) Truncate fragment if it is too long

2022-06-02 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-39365:
--

 Summary: Truncate fragment if it is too long
 Key: SPARK-39365
 URL: https://issues.apache.org/jira/browse/SPARK-39365
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39364) Logs are still "in_progress" when we read HDFS files

2022-06-02 Thread Marek Czuma (Jira)

Marek Czuma created SPARK-39364:
---

 Summary: Logs are still "in_progress" when we read HDFS files
 Key: SPARK-39364
 URL: https://issues.apache.org/jira/browse/SPARK-39364
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.2
Reporter: Marek Czuma


When I read file from HDFS using hadoop FileSystem after SparkSession 
initializing, logs are still "in_progress" in event log. In Spark History 
Server logs from this application are empty (no jobs, no stages etc.).

To be clear: job is finished successfully, but in Spark History Server, we have 
this strange situation.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning

2022-06-02 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545479#comment-17545479
 ] 

Thomas Graves commented on SPARK-39363:
---

[~Kimahriman] 

> fix spark.kubernetes.memoryOverheadFactor deprecation warning
> -
>
> Key: SPARK-39363
> URL: https://issues.apache.org/jira/browse/SPARK-39363
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Thomas Graves
>Priority: Major
>
> see [https://github.com/apache/spark/pull/36744] for details.
>  
> It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it 
> has a default value which leads it to printing warnings all the time.}}
> {{}}
> {code:java}
> 22/06/01 23:53:49 WARN SparkConf: The configuration key 
> 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 
> and may be removed in the future. Please use 
> spark.driver.memoryOverheadFactor and 
> spark.executor.memoryOverheadFactor{code}
> {{}}
> {{We should remove the default value if possible. It should only be used as 
> fallback but we should be able to use the default from }}
> spark.driver.memoryOverheadFactor.
> {{{}{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning

2022-06-02 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-39363:
--
Description: 
see [https://github.com/apache/spark/pull/36744] for details.

 

It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has 
a default value which leads it to printing warnings all the time.}}

{{}}
{code:java}
22/06/01 23:53:49 WARN SparkConf: The configuration key 
'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 
and may be removed in the future. Please use spark.driver.memoryOverheadFactor 
and spark.executor.memoryOverheadFactor{code}
{{}}

{{We should remove the default value if possible. It should only be used as 
fallback but we should be able to use the default from 
}}spark.driver.memoryOverheadFactor.

 

  was:
see [https://github.com/apache/spark/pull/36744] for details.

 

It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has 
a default value which leads it to printing warnings all the time.}}

{{}}
{code:java}
22/06/01 23:53:49 WARN SparkConf: The configuration key 
'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 
and may be removed in the future. Please use spark.driver.memoryOverheadFactor 
and spark.executor.memoryOverheadFactor{code}
{{}}

{{We should remove the default value if possible. It should only be used as 
fallback but we should be able to use the default from }}

spark.driver.memoryOverheadFactor.

{{{}{}}}{{{}{}}}


> fix spark.kubernetes.memoryOverheadFactor deprecation warning
> -
>
> Key: SPARK-39363
> URL: https://issues.apache.org/jira/browse/SPARK-39363
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.4.0
>Reporter: Thomas Graves
>Priority: Major
>
> see [https://github.com/apache/spark/pull/36744] for details.
>  
> It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it 
> has a default value which leads it to printing warnings all the time.}}
> {{}}
> {code:java}
> 22/06/01 23:53:49 WARN SparkConf: The configuration key 
> 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 
> and may be removed in the future. Please use 
> spark.driver.memoryOverheadFactor and 
> spark.executor.memoryOverheadFactor{code}
> {{}}
> {{We should remove the default value if possible. It should only be used as 
> fallback but we should be able to use the default from 
> }}spark.driver.memoryOverheadFactor.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning

2022-06-02 Thread Thomas Graves (Jira)

Thomas Graves created SPARK-39363:
-

 Summary: fix spark.kubernetes.memoryOverheadFactor deprecation 
warning
 Key: SPARK-39363
 URL: https://issues.apache.org/jira/browse/SPARK-39363
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.4.0
Reporter: Thomas Graves


see [https://github.com/apache/spark/pull/36744] for details.

 

It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has 
a default value which leads it to printing warnings all the time.}}

{{}}
{code:java}
22/06/01 23:53:49 WARN SparkConf: The configuration key 
'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 
and may be removed in the future. Please use spark.driver.memoryOverheadFactor 
and spark.executor.memoryOverheadFactor{code}
{{}}

{{We should remove the default value if possible. It should only be used as 
fallback but we should be able to use the default from }}

spark.driver.memoryOverheadFactor.

{{{}{}}}{{{}{}}}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38807) Error when starting spark shell on Windows system

2022-06-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-38807:


Assignee: Ming Li

> Error when starting spark shell on Windows system
> -
>
> Key: SPARK-38807
> URL: https://issues.apache.org/jira/browse/SPARK-38807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Ming Li
>Assignee: Ming Li
>Priority: Major
> Fix For: 3.2.2, 3.3.1
>
>
> Using the release version of spark-3.2.1  and the default configuration, 
> starting spark shell on Windows system fails. (spark 3.1.2 doesn't show this 
> issue)
> Here is the stack trace of the exception:
> {code:java}
> 22/04/06 21:47:45 ERROR SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         ...
>         at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>         at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>         at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>         at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>         at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>         at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.net.URISyntaxException: Illegal character in path at index 
> 30: spark://192.168.X.X:56964/F:\classes
>         at java.net.URI$Parser.fail(URI.java:2845)
>         at java.net.URI$Parser.checkChars(URI.java:3018)
>         at java.net.URI$Parser.parseHierarchical(URI.java:3102)
>         at java.net.URI$Parser.parse(URI.java:3050)
>         at java.net.URI.(URI.java:588)
>         at 
> org.apache.spark.repl.ExecutorClassLoader.(ExecutorClassLoader.scala:57)
>         ... 70 more
> 22/04/06 21:47:45 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38807) Error when starting spark shell on Windows system

2022-06-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-38807.
--
Fix Version/s: 3.3.1
   3.2.2
   Resolution: Fixed

Issue resolved by pull request 36447
[https://github.com/apache/spark/pull/36447]

> Error when starting spark shell on Windows system
> -
>
> Key: SPARK-38807
> URL: https://issues.apache.org/jira/browse/SPARK-38807
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.1
>Reporter: Ming Li
>Priority: Major
> Fix For: 3.3.1, 3.2.2
>
>
> Using the release version of spark-3.2.1  and the default configuration, 
> starting spark shell on Windows system fails. (spark 3.1.2 doesn't show this 
> issue)
> Here is the stack trace of the exception:
> {code:java}
> 22/04/06 21:47:45 ERROR SparkContext: Error initializing SparkContext.
> java.lang.reflect.InvocationTargetException
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         ...
>         at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>         at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
>         at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>         at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>         at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>         at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
>         at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.net.URISyntaxException: Illegal character in path at index 
> 30: spark://192.168.X.X:56964/F:\classes
>         at java.net.URI$Parser.fail(URI.java:2845)
>         at java.net.URI$Parser.checkChars(URI.java:3018)
>         at java.net.URI$Parser.parseHierarchical(URI.java:3102)
>         at java.net.URI$Parser.parse(URI.java:3050)
>         at java.net.URI.(URI.java:588)
>         at 
> org.apache.spark.repl.ExecutorClassLoader.(ExecutorClassLoader.scala:57)
>         ... 70 more
> 22/04/06 21:47:45 ERROR Utils: Uncaught exception in thread main
> java.lang.NullPointerException
> ... {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2022-06-02 Thread Stephen Kestle (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545437#comment-17545437
 ] 

Stephen Kestle commented on SPARK-33121:


I seem to have encountered this problem without streaming in version 3.2.1. My 
large session has many complex stages and writes unique data. I've enabled 
re-runs, and so do an anti-join on already written data in the (jdbc) database.

At this stage, it seems to be applying the anti-join right at the end after all 
the complex computations and reduces millions of rows to 0. Because I have many 
executors, I coalesce down to 30 partitions for writing to jdbc.

I'm getting hundreds of {{RejectedExecutionExceptions}} - it seems to me that 
the coalescing starts, and simultaneously it determines 0 rows and finishes the 
write and exits, resulting in non-graceful shutdown.

Calling {{sc.stop()}} does nothing, but {{df.cache()}} before coalescing and 
writing does.

Should this be reported as a separate ticket? I asked on 
[gitter|https://gitter.im/spark-scala/Lobby?at=6298a15306a77e1e18684826] too, 
and thought this actually did seem similar enough to comment.

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.1.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>  Labels: Streaming, hang, shutdown
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WA

[jira] [Resolved] (SPARK-39354) The analysis exception is incorrect

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-39354.
--
Fix Version/s: 3.3.0
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 36746
[https://github.com/apache/spark/pull/36746]

> The analysis exception is incorrect
> ---
>
> Key: SPARK-39354
> URL: https://issues.apache.org/jira/browse/SPARK-39354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0, 3.4.0
>
>
> {noformat}
> scala> spark.sql("create table t1(user_id int, auct_end_dt date) using 
> parquet;")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select * from t1 join t2 on t1.user_id = t2.user_id where 
> t1.auct_end_dt >= Date_sub('2020-12-27', 90)").show
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'date_sub('2020-12-27', 90)' due to data type mismatch: argument 1 requires 
> date type, however, ''2020-12-27'' is of string type.; line 1 pos 76
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4334)
>   at 
> org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82$adapted(Analyzer.scala:4327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:364)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:364)
> {noformat}
> The analysis exception should be:
> {noformat}
> org.apache.spark.sql.AnalysisException: Table or view not found: t2
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39354) The analysis exception is incorrect

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-39354:


Assignee: Yang Jie

> The analysis exception is incorrect
> ---
>
> Key: SPARK-39354
> URL: https://issues.apache.org/jira/browse/SPARK-39354
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yuming Wang
>Assignee: Yang Jie
>Priority: Minor
>
> {noformat}
> scala> spark.sql("create table t1(user_id int, auct_end_dt date) using 
> parquet;")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("select * from t1 join t2 on t1.user_id = t2.user_id where 
> t1.auct_end_dt >= Date_sub('2020-12-27', 90)").show
> org.apache.spark.sql.AnalysisException: cannot resolve 
> 'date_sub('2020-12-27', 90)' due to data type mismatch: argument 1 requires 
> date type, however, ''2020-12-27'' is of string type.; line 1 pos 76
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4334)
>   at 
> org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82$adapted(Analyzer.scala:4327)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:365)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:364)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:364)
> {noformat}
> The analysis exception should be:
> {noformat}
> org.apache.spark.sql.AnalysisException: Table or view not found: t2
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39362) Datasource v2 scan not pruned in tpcds q10

2022-06-02 Thread Zhen Wang (Jira)

Zhen Wang created SPARK-39362:
-

 Summary: Datasource v2 scan not pruned in tpcds q10
 Key: SPARK-39362
 URL: https://issues.apache.org/jira/browse/SPARK-39362
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
 Environment: Spark 3.2 + Iceberg 3.2
Reporter: Zhen Wang




{code:java}
createAndInitTable("id INT, dep STRING");
createAndInitTable(
wideTableName, "id INT, c1 STRING, c2 STRING, c3 STRING, c4 STRING, c5 
STRING", null);

String q1 = String.format("select dep from %s t1 where id > 1 and exists " +
"(select * from %s t2  where t1.id = t2.id)", tableName, wideTableName);

Assert.assertEquals("should prune scan", 1, scanCols);
{code}

it looks like select * in subqueries is not fully projected when 
v2ScanRelationPushdown is applied. after it is later projected the scan wont be 
changed after that. 







--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39346) Convert asserts/illegal state exception to internal errors on each phase

2022-06-02 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-39346:
-
Fix Version/s: 3.3.1

> Convert asserts/illegal state exception to internal errors on each phase
> 
>
> Key: SPARK-39346
> URL: https://issues.apache.org/jira/browse/SPARK-39346
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0, 3.3.1
>
>
> Wrap assert/illegal state exception by internal errors on each phase of query 
> execution.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39295) Improve documentation of pandas API support list.

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545361#comment-17545361
 ] 

Apache Spark commented on SPARK-39295:
--

User 'beobest2' has created a pull request for this issue:
https://github.com/apache/spark/pull/36749

> Improve documentation of pandas API support list.
> -
>
> Key: SPARK-39295
> URL: https://issues.apache.org/jira/browse/SPARK-39295
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyunwoo Park
>Assignee: Hyunwoo Park
>Priority: Major
> Fix For: 3.4.0
>
>
> The description provided in the supported pandas API list document or the 
> code comment needs improvement. Also, there are cases where the link of the 
> function property provided in the document is not connected, so it needs to 
> be corrected.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks

2022-06-02 Thread oskarryn (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339
 ] 

oskarryn edited comment on SPARK-33320 at 6/2/22 8:48 AM:
--

I'm afraid the ExecutorMetrics namespace only features from Spark 3. While on 
Spark 2.4.8 I'm also missing ExecutorMetrics.

Looking into the code, the parameter 
`spark.metrics.executorMetricsSource.enabled` was added from Spark 3. Also, the 
class ExecutorMetrics isn't even defined on Spark's 2.4 branch.


was (Author: oskarryn):
Executors should forward metrics to the driver and we should see 
ExecutorMetrics namespace in the driver component metrics. I'm afraid this 
feature is available from Spark 3 only. While on Spark 2.4.8 I'm also missing 
ExecutorMetrics.

Looking into the code, the parameter 
`spark.metrics.executorMetricsSource.enabled` was added from Spark 3. Also, the 
class ExecutorMetrics isn't even defined on Spark's 2.4 branch.

> ExecutorMetrics are not written to CSV and StatsD sinks
> ---
>
> Key: SPARK-33320
> URL: https://issues.apache.org/jira/browse/SPARK-33320
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant 
> part of the config is below:
> {noformat}
> spark.metrics.executorMetricsSource.enabled=true
> spark.eventLog.logStageExecutorMetrics=true
> spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
> spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet
> spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json
> spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink
> spark.metrics.conf.*.sink.statsd.host=localhost
> spark.metrics.conf.*.sink.statsd.port=8125
> spark.metrics.conf.*.sink.statsd.period=10
> spark.metrics.conf.*.sink.statsd.unit=seconds
> spark.metrics.conf.*.sink.statsd.prefix=spark
> master.sink.servlet.path=/home/hadoop/metrics/master/json
> applications.sink.servlet.path=/home/hadoop/metrics/applications/json
> {noformat}
>Reporter: Peter Podlovics
>Priority: Major
>
> Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and 
> StatsD sinks, even though some of them is available through the REST API 
> (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}).
> I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-38675) Race condition in BlockInfoManager during unlock

2022-06-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38675.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35991
[https://github.com/apache/spark/pull/35991]

> Race condition in BlockInfoManager during unlock
> 
>
> Key: SPARK-38675
> URL: https://issues.apache.org/jira/browse/SPARK-38675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Blocker
> Fix For: 3.3.0
>
>
> There is a race condition between unlock and releaseAllLocksForTask in the 
> block manager. This can lead to negative reader counts (which trip an 
> assertion).



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks

2022-06-02 Thread oskarryn (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339
 ] 

oskarryn edited comment on SPARK-33320 at 6/2/22 8:40 AM:
--

Executors should forward metrics to the driver and we should see 
ExecutorMetrics namespace in the driver component metrics. I'm afraid this 
feature is available from Spark 3 only. While on Spark 2.4.8 I'm also missing 
ExecutorMetrics.

Looking into the code, the parameter 
`spark.metrics.executorMetricsSource.enabled` was added from Spark 3. Also, the 
class ExecutorMetrics isn't even defined on Spark's 2.4 branch.


was (Author: oskarryn):
I have the same issue while on Spark 2.4.8. Supposedly executors forward 
metrics to the driver and we should see ExecutorMetrics namespace in the driver 
component metrics, but sadly no executor metrics are available there. I think 
this feature is available only from Spark 3 (it works then). On 2.4 branch 
there isn't even ExecutorMetrics class defined.

> ExecutorMetrics are not written to CSV and StatsD sinks
> ---
>
> Key: SPARK-33320
> URL: https://issues.apache.org/jira/browse/SPARK-33320
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant 
> part of the config is below:
> {noformat}
> spark.metrics.executorMetricsSource.enabled=true
> spark.eventLog.logStageExecutorMetrics=true
> spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
> spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet
> spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json
> spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink
> spark.metrics.conf.*.sink.statsd.host=localhost
> spark.metrics.conf.*.sink.statsd.port=8125
> spark.metrics.conf.*.sink.statsd.period=10
> spark.metrics.conf.*.sink.statsd.unit=seconds
> spark.metrics.conf.*.sink.statsd.prefix=spark
> master.sink.servlet.path=/home/hadoop/metrics/master/json
> applications.sink.servlet.path=/home/hadoop/metrics/applications/json
> {noformat}
>Reporter: Peter Podlovics
>Priority: Major
>
> Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and 
> StatsD sinks, even though some of them is available through the REST API 
> (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}).
> I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks

2022-06-02 Thread oskarryn (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339
 ] 

oskarryn edited comment on SPARK-33320 at 6/2/22 8:35 AM:
--

I have the same issue while on Spark 2.4.8. Supposedly executors forward 
metrics to the driver and we should see ExecutorMetrics namespace in the driver 
component metrics, but sadly no executor metrics are available there. I think 
this feature is available only from Spark 3 (it works then). On 2.4 branch 
there isn't even ExecutorMetrics class defined.


was (Author: oskarryn):
I have the same issue while on Spark 2.4.8. Supposedly executors forward 
metrics to the driver and we should see ExecutorMetrics namespace in the driver 
component metrics, but sadly no executor metrics are available there.

> ExecutorMetrics are not written to CSV and StatsD sinks
> ---
>
> Key: SPARK-33320
> URL: https://issues.apache.org/jira/browse/SPARK-33320
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant 
> part of the config is below:
> {noformat}
> spark.metrics.executorMetricsSource.enabled=true
> spark.eventLog.logStageExecutorMetrics=true
> spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
> spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet
> spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json
> spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink
> spark.metrics.conf.*.sink.statsd.host=localhost
> spark.metrics.conf.*.sink.statsd.port=8125
> spark.metrics.conf.*.sink.statsd.period=10
> spark.metrics.conf.*.sink.statsd.unit=seconds
> spark.metrics.conf.*.sink.statsd.prefix=spark
> master.sink.servlet.path=/home/hadoop/metrics/master/json
> applications.sink.servlet.path=/home/hadoop/metrics/applications/json
> {noformat}
>Reporter: Peter Podlovics
>Priority: Major
>
> Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and 
> StatsD sinks, even though some of them is available through the REST API 
> (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}).
> I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks

2022-06-02 Thread oskarryn (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339
 ] 

oskarryn commented on SPARK-33320:
--

I have the same issue while on Spark 2.4.8. Supposedly executors forward 
metrics to the driver and we should see ExecutorMetrics namespace in the driver 
component metrics, but sadly no executor metrics are available there.

> ExecutorMetrics are not written to CSV and StatsD sinks
> ---
>
> Key: SPARK-33320
> URL: https://issues.apache.org/jira/browse/SPARK-33320
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
> Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant 
> part of the config is below:
> {noformat}
> spark.metrics.executorMetricsSource.enabled=true
> spark.eventLog.logStageExecutorMetrics=true
> spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink
> spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet
> spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json
> spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink
> spark.metrics.conf.*.sink.statsd.host=localhost
> spark.metrics.conf.*.sink.statsd.port=8125
> spark.metrics.conf.*.sink.statsd.period=10
> spark.metrics.conf.*.sink.statsd.unit=seconds
> spark.metrics.conf.*.sink.statsd.prefix=spark
> master.sink.servlet.path=/home/hadoop/metrics/master/json
> applications.sink.servlet.path=/home/hadoop/metrics/applications/json
> {noformat}
>Reporter: Peter Podlovics
>Priority: Major
>
> Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and 
> StatsD sinks, even though some of them is available through the REST API 
> (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}).
> I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545296#comment-17545296
 ] 

Apache Spark commented on SPARK-38961:
--

User 'beobest2' has created a pull request for this issue:
https://github.com/apache/spark/pull/36748

> Enhance to automatically generate the pandas API support list
> -
>
> Key: SPARK-38961
> URL: https://issues.apache.org/jira/browse/SPARK-38961
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Hyunwoo Park
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the supported pandas API list is manually maintained, so it would 
> be better to make the list automatically generated to reduce the maintenance 
> cost.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list

2022-06-02 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545295#comment-17545295
 ] 

Apache Spark commented on SPARK-38961:
--

User 'beobest2' has created a pull request for this issue:
https://github.com/apache/spark/pull/36748

> Enhance to automatically generate the pandas API support list
> -
>
> Key: SPARK-38961
> URL: https://issues.apache.org/jira/browse/SPARK-38961
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Hyunwoo Park
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the supported pandas API list is manually maintained, so it would 
> be better to make the list automatically generated to reduce the maintenance 
> cost.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

84 matches

Mail list logo