[jira] [Updated] (SPARK-39357) pmCache memory leak caused by IsolatedClassLoader
[ https://issues.apache.org/jira/browse/SPARK-39357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianshuang updated SPARK-39357: --- Attachment: JVM Heap Long Lived Pool.jpg JVM Heap.jpg > pmCache memory leak caused by IsolatedClassLoader > - > > Key: SPARK-39357 > URL: https://issues.apache.org/jira/browse/SPARK-39357 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.2.1 >Reporter: tianshuang >Priority: Major > Attachments: JVM Heap Long Lived Pool.jpg, JVM Heap.jpg, > Xnip2022-06-01_23-09-35.jpg, Xnip2022-06-01_23-19-35.jpeg, > Xnip2022-06-01_23-32-39.jpg > > > I found this bug in Spark 2.4.4, because the related code has not changed, so > this bug still exists on master, the following is a brief description of this > bug: > In May 2015, > [SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568] > introduced isolated classloader for HiveMetastore to support Hive > multi-version loading, but this PR resulted in [RawStore cleanup > mechanism|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java > #L27-L42] is broken because the `ThreadWithGarbageCleanup` class used by > `HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and > `HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source > code of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = > HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the > `threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by > AppClassLoader), and in the process of thread execution, the `client` > actually created by isolatedClassLoader, in the process of obtaining > `RawStore` instance through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` > instance is set to `threadLocalMS`, but the static `threadLocalMS` instance > belongs to `HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the > set and get methods do not operate on the same `threadLocalMS` instance, so > in `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained > `RawStore` instance is null, so the subsequent `RawStore` cleaning logic does > not take effect, because the `shutdown` method of `RawStore` instance is not > called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak. > Long-running Spark ThriftServer end up with frequent GCs, resulting in poor > performance. > I analyzed the heap dump using MAT and executed the following OQL: `SELECT * > FROM INSTANCEOF java.lang.Class c WHERE c.@displayName.contains("class > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances > of the `HMSHandler` *Class* can be found in the heap. Also know that they > each hold a static `threadLocalMS` instance. > We execute the following OQL: `select * from > org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that > `pmCache` of the `JDOPersistenceManagerFactory` instance has over 200,000 > elements and consumes 1.3GB of memory. > We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c > WHERE c.@displayName.contains("class > org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see > that there is no element in the static instance `threadRawStoreMap` of > `ThreadFactoryWithGarbageCleanup`, which confirms the above statement, > because `HMSHandler.getRawStore()` in > `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the > `threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of > `threadLocalMS` instance in `HMSHandler`(loaded by > IsolatedClassLoader$$anon$1). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39357) pmCache memory leak caused by IsolatedClassLoader
[ https://issues.apache.org/jira/browse/SPARK-39357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] tianshuang updated SPARK-39357: --- Description: I found this bug in Spark 2.4.4, because the related code has not changed, so this bug still exists on master, the following is a brief description of this bug: In May 2015, [SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568] introduced isolated classloader for HiveMetastore to support Hive multi-version loading, but this PR resulted in [RawStore cleanup mechanism|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java #L27-L42] is broken because the `ThreadWithGarbageCleanup` class used by `HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and `HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the `threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by AppClassLoader), and in the process of thread execution, the `client` actually created by isolatedClassLoader, in the process of obtaining `RawStore` instance through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to `threadLocalMS`, but the static `threadLocalMS` instance belongs to `HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get methods do not operate on the same `threadLocalMS` instance, so in `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained `RawStore` instance is null, so the subsequent `RawStore` cleaning logic does not take effect, because the `shutdown` method of `RawStore` instance is not called, resulting in `pmCache` of `JDOPersistenceManagerFactory` memory leak. Long-running Spark ThriftServer end up with frequent GCs, resulting in poor performance. I analyzed the heap dump using MAT and executed the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c WHERE c.@displayName.contains("class org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler ")`, two instances of the `HMSHandler` *Class* can be found in the heap. Also know that they each hold a static `threadLocalMS` instance. We execute the following OQL: `select * from org.datanucleus.api.jdo.JDOPersistenceManagerFactory`, we can see that `pmCache` of the `JDOPersistenceManagerFactory` instance has over 200,000 elements and consumes 1.3GB of memory. We execute the following OQL: `SELECT * FROM INSTANCEOF java.lang.Class c WHERE c.@displayName.contains("class org.apache.hive.service.server.ThreadFactoryWithGarbageCleanup")`, we can see that there is no element in the static instance `threadRawStoreMap` of `ThreadFactoryWithGarbageCleanup`, which confirms the above statement, because `HMSHandler.getRawStore()` in `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` is called on the `threadLocalMS` instance in `HMSHandler`(loaded by AppClassLoader) instead of `threadLocalMS` instance in `HMSHandler`(loaded by IsolatedClassLoader$$anon$1). was: I found this bug in Spark 2.4.4, because the related code has not changed, so this bug still exists on master, the following is a brief description of this bug: In May 2015, [SPARK-6907|https://github.com/apache/spark/commit/daa70bf135f23381f5f410aa95a1c0e5a2888568] introduced isolated classloader for HiveMetastore to support Hive multi-version loading, but this PR resulted in [RawStore cleanup mechanism|https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/java/org/apache/hive/service/server/ThreadFactoryWithGarbageCleanup.java #L27-L42] is broken because the `ThreadWithGarbageCleanup` class used by `HiveServer2-Handler-Pool` and `HiveServer2-Background-Pool` and `HiveServer2-HttpHandler-Pool` is loaded by AppClassLoader, in the source code of `ThreadWithGarbageCleanup` class: `RawStore threadLocalRawStore = HiveMetaStore.HMSHandler.getRawStore();` This line of code will use the `threadLocalMS` instance in `HiveMetaStore.HMSHandler` (loaded by AppClassLoader), and in the process of thread execution, the `client` actually created by isolatedClassLoader, in the process of obtaining `RawStore` instance through `HiveMetaStore.HMSHandler#getMSForConf`, the `ms` instance is set to `threadLocalMS`, but the static `threadLocalMS` instance belongs to `HMSHandler`(loaded by IsolatedClassLoader$$anon$1), that is, the set and get methods do not operate on the same `threadLocalMS` instance, so in `ThreadWithGarbageCleanup#cacheThreadLocalRawStore` method, the obtained `RawStore` instance is null, so the subsequent `RawStore` cleaning logic does not take effect, because the `shutdown` method of `RawStore` instance is not called, resulting in `pmCache` of `JDOPersistenceManagerFactor
[jira] [Updated] (SPARK-39369) Use JAVA_OPTS for AppVeyer build to increase the memory properly
[ https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39369: - Description: https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 AppVeyor build is being failed because of the lack of memory. Should use JAVA_OPTS to make the memory configured properly. was: https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 AppVeyor build is being failed because of the lack of memory. We should increase it to make the build pass > Use JAVA_OPTS for AppVeyer build to increase the memory properly > > > Key: SPARK-39369 > URL: https://issues.apache.org/jira/browse/SPARK-39369 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 > AppVeyor build is being failed because of the lack of memory. > Should use JAVA_OPTS to make the memory configured properly. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39369) Use JAVA_OPTS for AppVeyer build to increase the memory properly
[ https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39369: - Summary: Use JAVA_OPTS for AppVeyer build to increase the memory properly (was: Increase memory in AppVeyor build) > Use JAVA_OPTS for AppVeyer build to increase the memory properly > > > Key: SPARK-39369 > URL: https://issues.apache.org/jira/browse/SPARK-39369 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 > AppVeyor build is being failed because of the lack of memory. We should > increase it to make the build pass -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37623) Support ANSI Aggregate Function: regr_intercept
[ https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-37623. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36708 [https://github.com/apache/spark/pull/36708] > Support ANSI Aggregate Function: regr_intercept > --- > > Key: SPARK-37623 > URL: https://issues.apache.org/jira/browse/SPARK-37623 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > REGR_INTERCEPT is an ANSI aggregate functions. many database support it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37623) Support ANSI Aggregate Function: regr_intercept
[ https://issues.apache.org/jira/browse/SPARK-37623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-37623: Assignee: jiaan.geng > Support ANSI Aggregate Function: regr_intercept > --- > > Key: SPARK-37623 > URL: https://issues.apache.org/jira/browse/SPARK-37623 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > REGR_INTERCEPT is an ANSI aggregate functions. many database support it. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries
[ https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39259: - Fix Version/s: 3.3.0 > Timestamps returned by now() and equivalent functions are not consistent in > subqueries > -- > > Key: SPARK-39259 > URL: https://issues.apache.org/jira/browse/SPARK-39259 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 > Environment: Reproduced in the Spark Shell on the current 3.4.0 > snapshot >Reporter: Jan-Ole Sasse >Assignee: Jan-Ole Sasse >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > Timestamp evaluation in not consistent across subqueries. As an example for > the Spark Shell > > {code:java} > sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code} > Returns an empty result. > > The root cause is that > [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74] > does not iterate into subqueries -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module
[ https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545748#comment-17545748 ] Apache Spark commented on SPARK-39371: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/36757 > Review and fix issues in Scala/Java API docs of Core module > --- > > Key: SPARK-39371 > URL: https://issues.apache.org/jira/browse/SPARK-39371 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module
[ https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39371: Assignee: (was: Apache Spark) > Review and fix issues in Scala/Java API docs of Core module > --- > > Key: SPARK-39371 > URL: https://issues.apache.org/jira/browse/SPARK-39371 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module
[ https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545747#comment-17545747 ] Apache Spark commented on SPARK-39371: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/36757 > Review and fix issues in Scala/Java API docs of Core module > --- > > Key: SPARK-39371 > URL: https://issues.apache.org/jira/browse/SPARK-39371 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yuanjian Li >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module
[ https://issues.apache.org/jira/browse/SPARK-39371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39371: Assignee: Apache Spark > Review and fix issues in Scala/Java API docs of Core module > --- > > Key: SPARK-39371 > URL: https://issues.apache.org/jira/browse/SPARK-39371 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39371) Review and fix issues in Scala/Java API docs of Core module
Yuanjian Li created SPARK-39371: --- Summary: Review and fix issues in Scala/Java API docs of Core module Key: SPARK-39371 URL: https://issues.apache.org/jira/browse/SPARK-39371 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 3.3.0 Reporter: Yuanjian Li -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39320) Add the MEDIAN() function
[ https://issues.apache.org/jira/browse/SPARK-39320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39320. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36714 [https://github.com/apache/spark/pull/36714] > Add the MEDIAN() function > - > > Key: SPARK-39320 > URL: https://issues.apache.org/jira/browse/SPARK-39320 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: jiaan.geng >Priority: Major > Fix For: 3.4.0 > > > Add the MEDIAN() function which can be implemented as *a specific case of > PERCENTILE_CONT where the percentile value defaults to 0.5.* -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39320) Add the MEDIAN() function
[ https://issues.apache.org/jira/browse/SPARK-39320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39320: Assignee: jiaan.geng > Add the MEDIAN() function > - > > Key: SPARK-39320 > URL: https://issues.apache.org/jira/browse/SPARK-39320 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: jiaan.geng >Priority: Major > > Add the MEDIAN() function which can be implemented as *a specific case of > PERCENTILE_CONT where the percentile value defaults to 0.5.* -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39370) Inline type hints in PySpark
[ https://issues.apache.org/jira/browse/SPARK-39370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39370. -- Fix Version/s: 3.3.0 Resolution: Done > Inline type hints in PySpark > > > Key: SPARK-39370 > URL: https://issues.apache.org/jira/browse/SPARK-39370 > Project: Spark > Issue Type: Epic > Components: PySpark >Affects Versions: 3.2.1, 3.3.0 >Reporter: Hyukjin Kwon >Priority: Critical > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37396) Inline type hint files for files in python/pyspark/mllib
[ https://issues.apache.org/jira/browse/SPARK-37396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37396: - Epic Link: SPARK-39370 > Inline type hint files for files in python/pyspark/mllib > > > Key: SPARK-37396 > URL: https://issues.apache.org/jira/browse/SPARK-37396 > Project: Spark > Issue Type: Umbrella > Components: MLlib, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Currently there are type hint stub files ({{*.pyi}}) to show the expected > types for functions, but we can also take advantage of static type checking > within the functions by inlining the type hints. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37093) Inline type hints python/pyspark/streaming
[ https://issues.apache.org/jira/browse/SPARK-37093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37093: - Epic Link: SPARK-39370 > Inline type hints python/pyspark/streaming > -- > > Key: SPARK-37093 > URL: https://issues.apache.org/jira/browse/SPARK-37093 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37395) Inline type hint files for files in python/pyspark/ml
[ https://issues.apache.org/jira/browse/SPARK-37395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37395: - Epic Link: SPARK-39370 > Inline type hint files for files in python/pyspark/ml > - > > Key: SPARK-37395 > URL: https://issues.apache.org/jira/browse/SPARK-37395 > Project: Spark > Issue Type: Umbrella > Components: ML, PySpark >Affects Versions: 3.3.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.3.0 > > > Currently there are type hint stub files ({{*.pyi}}) to show the expected > types for functions, but we can also take advantage of static type checking > within the functions by inlining the type hints. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36845) Inline type hint files for files in python/pyspark/sql
[ https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-36845: - Epic Link: SPARK-39370 > Inline type hint files for files in python/pyspark/sql > -- > > Key: SPARK-36845 > URL: https://issues.apache.org/jira/browse/SPARK-36845 > Project: Spark > Issue Type: Umbrella > Components: PySpark, SQL >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Xinrong Meng >Priority: Major > > Currently there are type hint stub files ({{*.pyi}}) to show the expected > types for functions, but we can also take advantage of static type checking > within the functions by inlining the type hints. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37094) Inline type hints for files in python/pyspark
[ https://issues.apache.org/jira/browse/SPARK-37094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37094: - Epic Link: SPARK-39370 > Inline type hints for files in python/pyspark > - > > Key: SPARK-37094 > URL: https://issues.apache.org/jira/browse/SPARK-37094 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: dch nguyen >Assignee: dch nguyen >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39370) Inline type hints in PySpark
Hyukjin Kwon created SPARK-39370: Summary: Inline type hints in PySpark Key: SPARK-39370 URL: https://issues.apache.org/jira/browse/SPARK-39370 Project: Spark Issue Type: Epic Components: PySpark Affects Versions: 3.2.1, 3.3.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39366) BlockInfoManager should not release write locks on task end
[ https://issues.apache.org/jira/browse/SPARK-39366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39366: Assignee: Herman van Hövell (was: Apache Spark) > BlockInfoManager should not release write locks on task end > --- > > Key: SPARK-39366 > URL: https://issues.apache.org/jira/browse/SPARK-39366 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > The BlockInfoManager releases all locks held by a task when the task is done. > It also release write locks, the problem with that is that a thread (other > than the main task thread) might still be modifying the block. By releasing > it the block now seems readable, and a reader might observe a block in a > partial or non-existent state. > This is fortunately not the massive problem as it appears to be because the > BlockManager (only place where we write blocks) is well behaved and always > puts the block in a consistent state. This means the errors caused by this > are only transient. > Given the fact that the write code is well behaved we don't need to release > the write locks on task end. We should remove that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39366) BlockInfoManager should not release write locks on task end
[ https://issues.apache.org/jira/browse/SPARK-39366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39366: Assignee: Apache Spark (was: Herman van Hövell) > BlockInfoManager should not release write locks on task end > --- > > Key: SPARK-39366 > URL: https://issues.apache.org/jira/browse/SPARK-39366 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > > The BlockInfoManager releases all locks held by a task when the task is done. > It also release write locks, the problem with that is that a thread (other > than the main task thread) might still be modifying the block. By releasing > it the block now seems readable, and a reader might observe a block in a > partial or non-existent state. > This is fortunately not the massive problem as it appears to be because the > BlockManager (only place where we write blocks) is well behaved and always > puts the block in a consistent state. This means the errors caused by this > are only transient. > Given the fact that the write code is well behaved we don't need to release > the write locks on task end. We should remove that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39366) BlockInfoManager should not release write locks on task end
[ https://issues.apache.org/jira/browse/SPARK-39366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545728#comment-17545728 ] Apache Spark commented on SPARK-39366: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/36751 > BlockInfoManager should not release write locks on task end > --- > > Key: SPARK-39366 > URL: https://issues.apache.org/jira/browse/SPARK-39366 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > > The BlockInfoManager releases all locks held by a task when the task is done. > It also release write locks, the problem with that is that a thread (other > than the main task thread) might still be modifying the block. By releasing > it the block now seems readable, and a reader might observe a block in a > partial or non-existent state. > This is fortunately not the massive problem as it appears to be because the > BlockManager (only place where we write blocks) is well behaved and always > puts the block in a consistent state. This means the errors caused by this > are only transient. > Given the fact that the write code is well behaved we don't need to release > the write locks on task end. We should remove that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39369) Increase memory in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39369: Assignee: (was: Apache Spark) > Increase memory in AppVeyor build > - > > Key: SPARK-39369 > URL: https://issues.apache.org/jira/browse/SPARK-39369 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 > AppVeyor build is being failed because of the lack of memory. We should > increase it to make the build pass -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39369) Increase memory in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545727#comment-17545727 ] Apache Spark commented on SPARK-39369: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/36756 > Increase memory in AppVeyor build > - > > Key: SPARK-39369 > URL: https://issues.apache.org/jira/browse/SPARK-39369 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 > AppVeyor build is being failed because of the lack of memory. We should > increase it to make the build pass -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39369) Increase memory in AppVeyor build
[ https://issues.apache.org/jira/browse/SPARK-39369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39369: Assignee: Apache Spark > Increase memory in AppVeyor build > - > > Key: SPARK-39369 > URL: https://issues.apache.org/jira/browse/SPARK-39369 > Project: Spark > Issue Type: Test > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 > AppVeyor build is being failed because of the lack of memory. We should > increase it to make the build pass -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545726#comment-17545726 ] Yuming Wang commented on SPARK-29260: - Please backport HIVE-8472 to your Hive metastore service if you see this error message: {noformat} Hive metastore does not support altering database location {noformat} > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-29260. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36750 [https://github.com/apache/spark/pull/36750] > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Chao Sun >Priority: Major > Fix For: 3.4.0 > > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-29260: --- Assignee: Chao Sun > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Chao Sun >Priority: Major > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39351) ShowCreateTable should redact properties
[ https://issues.apache.org/jira/browse/SPARK-39351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39351. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36736 [https://github.com/apache/spark/pull/36736] > ShowCreateTable should redact properties > > > Key: SPARK-39351 > URL: https://issues.apache.org/jira/browse/SPARK-39351 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.4.0 > > > ShowCreateTable should redact properties -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39351) ShowCreateTable should redact properties
[ https://issues.apache.org/jira/browse/SPARK-39351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39351: Assignee: angerszhu > ShowCreateTable should redact properties > > > Key: SPARK-39351 > URL: https://issues.apache.org/jira/browse/SPARK-39351 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > ShowCreateTable should redact properties -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39369) Increase memory in AppVeyor build
Hyukjin Kwon created SPARK-39369: Summary: Increase memory in AppVeyor build Key: SPARK-39369 URL: https://issues.apache.org/jira/browse/SPARK-39369 Project: Spark Issue Type: Test Components: Project Infra Affects Versions: 3.4.0 Reporter: Hyukjin Kwon https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/43740704 AppVeyor build is being failed because of the lack of memory. We should increase it to make the build pass -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39356) Add option to skip initial message in Pregel API
[ https://issues.apache.org/jira/browse/SPARK-39356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-39356: - Target Version/s: (was: 3.3.0) > Add option to skip initial message in Pregel API > > > Key: SPARK-39356 > URL: https://issues.apache.org/jira/browse/SPARK-39356 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 3.2.1 >Reporter: Aaron Zolnai-Lucas >Priority: Minor > Labels: graphx, pregel > > The current (3.2.1) [Pregel > API|https://github.com/apache/spark/blob/5a3ba9b0b301a3b0c43f8d0d88e2b6bdce57d0e6/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala#L117] > takes a parameter {{initialMsg: A}} where {{A : scala.reflect.ClassTag}} is > the message type for the Pregel iterations. At the start of the iterative > process, the user-supplied vertex update method {{vprog}} is called with the > initial message. > However, in some cases, the start point for a message passing scheme is best > described by starting with the {{message}} phase rather than the {{vprog}} > phase, and in many cases the first message depends on individual vertex data > (instead of a static initial message). In these cases, users are forced to > add boilerplate to their {{vprog}} function to check if the message received > is the {{initialMessage}} and ignore the message (leave the node state > unchanged) if it is. This leads to less efficient (due to extra iteration and > check) and less readable code. > > My proposed solution is to change {{initialMsg}} to a parameter of type > {{Option[A]}} with default {{{}None{}}}, and then inside {{Pregel.apply}} > function, set: > {code:scala} > var g = initialMsg match { > case Some(msg) => graph.mapVertices((vid, vdata) => vprog(vid, vdata, msg)) > case _ => graph > } > {code} > This way, the user chooses whether to start the iteration from the > {{message}} or {{vprog}} phase. I believe this small change could improve > user code readability and efficiency. > Note: The signature of {{GraphOps.pregel}} would have to be changed to match > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-39336) Redact table/partition properties
[ https://issues.apache.org/jira/browse/SPARK-39336 ] Hyukjin Kwon deleted comment on SPARK-39336: -- was (Author: apachespark): User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/36751 > Redact table/partition properties > - > > Key: SPARK-39336 > URL: https://issues.apache.org/jira/browse/SPARK-39336 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter
[ https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39368: Assignee: Apache Spark > Move RewritePredicateSubquery into InjectRuntimeFilter > -- > > Key: SPARK-39368 > URL: https://issues.apache.org/jira/browse/SPARK-39368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter
[ https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39368: Assignee: (was: Apache Spark) > Move RewritePredicateSubquery into InjectRuntimeFilter > -- > > Key: SPARK-39368 > URL: https://issues.apache.org/jira/browse/SPARK-39368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter
[ https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545721#comment-17545721 ] Apache Spark commented on SPARK-39368: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/36755 > Move RewritePredicateSubquery into InjectRuntimeFilter > -- > > Key: SPARK-39368 > URL: https://issues.apache.org/jira/browse/SPARK-39368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter
[ https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545720#comment-17545720 ] Apache Spark commented on SPARK-39368: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/36755 > Move RewritePredicateSubquery into InjectRuntimeFilter > -- > > Key: SPARK-39368 > URL: https://issues.apache.org/jira/browse/SPARK-39368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter
[ https://issues.apache.org/jira/browse/SPARK-39368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545719#comment-17545719 ] Yuming Wang commented on SPARK-39368: - https://github.com/apache/spark/pull/36755 > Move RewritePredicateSubquery into InjectRuntimeFilter > -- > > Key: SPARK-39368 > URL: https://issues.apache.org/jira/browse/SPARK-39368 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39368) Move RewritePredicateSubquery into InjectRuntimeFilter
Yuming Wang created SPARK-39368: --- Summary: Move RewritePredicateSubquery into InjectRuntimeFilter Key: SPARK-39368 URL: https://issues.apache.org/jira/browse/SPARK-39368 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37649: - Description: pandas API on Spark currently sets {{compute.default_index_type}} to {{sequence}} which relies on sending all data to one executor that easily causes OOM. We should better switch to {{distributed-sequence}} type that truly distributes the data. With this change, we can now leverage https://issues.apache.org/jira/browse/SPARK-36559 and https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users will benefit a lot of performance improvement. was: pandas API on Spark currently sets {{compute.default_index_type}} to {{sequence}} which relies on sending all data to one executor that easily causes OOM. We should better switch to {{distributed-sequence}} type that truly distributes the data. > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Labels: release-notes > Fix For: 3.3.0 > > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. > With this change, we can now leverage > https://issues.apache.org/jira/browse/SPARK-36559 and > https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users > will benefit a lot of performance improvement. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37649) Switch default index to distributed-sequence by default in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37649: - Priority: Critical (was: Major) > Switch default index to distributed-sequence by default in pandas API on Spark > -- > > Key: SPARK-37649 > URL: https://issues.apache.org/jira/browse/SPARK-37649 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Critical > Labels: release-notes > Fix For: 3.3.0 > > > pandas API on Spark currently sets {{compute.default_index_type}} to > {{sequence}} which relies on sending all data to one executor that easily > causes OOM. > We should better switch to {{distributed-sequence}} type that truly > distributes the data. > With this change, we can now leverage > https://issues.apache.org/jira/browse/SPARK-36559 and > https://issues.apache.org/jira/browse/SPARK-36338 by default, and end users > will benefit a lot of performance improvement. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module
[ https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39367. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36754 [https://github.com/apache/spark/pull/36754] > Review and fix issues in Scala/Java API docs of SQL module > -- > > Key: SPARK-39367 > URL: https://issues.apache.org/jira/browse/SPARK-39367 > Project: Spark > Issue Type: Task > Components: Documentation, SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module
[ https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39367: Assignee: Gengliang Wang (was: Apache Spark) > Review and fix issues in Scala/Java API docs of SQL module > -- > > Key: SPARK-39367 > URL: https://issues.apache.org/jira/browse/SPARK-39367 > Project: Spark > Issue Type: Task > Components: Documentation, SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module
[ https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39367: Assignee: Apache Spark (was: Gengliang Wang) > Review and fix issues in Scala/Java API docs of SQL module > -- > > Key: SPARK-39367 > URL: https://issues.apache.org/jira/browse/SPARK-39367 > Project: Spark > Issue Type: Task > Components: Documentation, SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module
[ https://issues.apache.org/jira/browse/SPARK-39367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545626#comment-17545626 ] Apache Spark commented on SPARK-39367: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/36754 > Review and fix issues in Scala/Java API docs of SQL module > -- > > Key: SPARK-39367 > URL: https://issues.apache.org/jira/browse/SPARK-39367 > Project: Spark > Issue Type: Task > Components: Documentation, SQL >Affects Versions: 3.3.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39367) Review and fix issues in Scala/Java API docs of SQL module
Gengliang Wang created SPARK-39367: -- Summary: Review and fix issues in Scala/Java API docs of SQL module Key: SPARK-39367 URL: https://issues.apache.org/jira/browse/SPARK-39367 Project: Spark Issue Type: Task Components: Documentation, SQL Affects Versions: 3.3.0 Reporter: Gengliang Wang Assignee: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries
[ https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545609#comment-17545609 ] Apache Spark commented on SPARK-39259: -- User 'olaky' has created a pull request for this issue: https://github.com/apache/spark/pull/36752 > Timestamps returned by now() and equivalent functions are not consistent in > subqueries > -- > > Key: SPARK-39259 > URL: https://issues.apache.org/jira/browse/SPARK-39259 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 > Environment: Reproduced in the Spark Shell on the current 3.4.0 > snapshot >Reporter: Jan-Ole Sasse >Assignee: Jan-Ole Sasse >Priority: Major > Fix For: 3.4.0 > > > Timestamp evaluation in not consistent across subqueries. As an example for > the Spark Shell > > {code:java} > sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code} > Returns an empty result. > > The root cause is that > [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74] > does not iterate into subqueries -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries
[ https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545608#comment-17545608 ] Apache Spark commented on SPARK-39259: -- User 'olaky' has created a pull request for this issue: https://github.com/apache/spark/pull/36753 > Timestamps returned by now() and equivalent functions are not consistent in > subqueries > -- > > Key: SPARK-39259 > URL: https://issues.apache.org/jira/browse/SPARK-39259 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 > Environment: Reproduced in the Spark Shell on the current 3.4.0 > snapshot >Reporter: Jan-Ole Sasse >Assignee: Jan-Ole Sasse >Priority: Major > Fix For: 3.4.0 > > > Timestamp evaluation in not consistent across subqueries. As an example for > the Spark Shell > > {code:java} > sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code} > Returns an empty result. > > The root cause is that > [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74] > does not iterate into subqueries -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39295) Improve documentation of pandas API support list.
[ https://issues.apache.org/jira/browse/SPARK-39295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39295: - Fix Version/s: 3.3.0 > Improve documentation of pandas API support list. > - > > Key: SPARK-39295 > URL: https://issues.apache.org/jira/browse/SPARK-39295 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.3.0, 3.4.0 > > > The description provided in the supported pandas API list document or the > code comment needs improvement. Also, there are cases where the link of the > function property provided in the document is not connected, so it needs to > be corrected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries
[ https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39259. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36654 [https://github.com/apache/spark/pull/36654] > Timestamps returned by now() and equivalent functions are not consistent in > subqueries > -- > > Key: SPARK-39259 > URL: https://issues.apache.org/jira/browse/SPARK-39259 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 > Environment: Reproduced in the Spark Shell on the current 3.4.0 > snapshot >Reporter: Jan-Ole Sasse >Assignee: Jan-Ole Sasse >Priority: Major > Fix For: 3.4.0 > > > Timestamp evaluation in not consistent across subqueries. As an example for > the Spark Shell > > {code:java} > sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code} > Returns an empty result. > > The root cause is that > [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74] > does not iterate into subqueries -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39259) Timestamps returned by now() and equivalent functions are not consistent in subqueries
[ https://issues.apache.org/jira/browse/SPARK-39259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39259: Assignee: Jan-Ole Sasse > Timestamps returned by now() and equivalent functions are not consistent in > subqueries > -- > > Key: SPARK-39259 > URL: https://issues.apache.org/jira/browse/SPARK-39259 > Project: Spark > Issue Type: Bug > Components: Optimizer >Affects Versions: 3.2.1 > Environment: Reproduced in the Spark Shell on the current 3.4.0 > snapshot >Reporter: Jan-Ole Sasse >Assignee: Jan-Ole Sasse >Priority: Major > > Timestamp evaluation in not consistent across subqueries. As an example for > the Spark Shell > > {code:java} > sql("SELECT * FROM (SELECT 1) WHERE now() IN (SELECT now())").collect() {code} > Returns an empty result. > > The root cause is that > [ComputeCurrentTime|https://github.com/apache/spark/blob/c91c2e9afec0d5d5bbbd2e155057fe409c5bb928/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala#L74] > does not iterate into subqueries -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39336) Redact table/partition properties
[ https://issues.apache.org/jira/browse/SPARK-39336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545592#comment-17545592 ] Apache Spark commented on SPARK-39336: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/36751 > Redact table/partition properties > - > > Key: SPARK-39336 > URL: https://issues.apache.org/jira/browse/SPARK-39336 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39336) Redact table/partition properties
[ https://issues.apache.org/jira/browse/SPARK-39336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39336: Assignee: Apache Spark > Redact table/partition properties > - > > Key: SPARK-39336 > URL: https://issues.apache.org/jira/browse/SPARK-39336 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39336) Redact table/partition properties
[ https://issues.apache.org/jira/browse/SPARK-39336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39336: Assignee: (was: Apache Spark) > Redact table/partition properties > - > > Key: SPARK-39336 > URL: https://issues.apache.org/jira/browse/SPARK-39336 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.3.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39283) Spark tasks stuck forever due to deadlock between TaskMemoryManager and UnsafeExternalSorter
[ https://issues.apache.org/jira/browse/SPARK-39283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-39283: --- Fix Version/s: 3.3.0 (was: 3.3.1) > Spark tasks stuck forever due to deadlock between TaskMemoryManager and > UnsafeExternalSorter > > > Key: SPARK-39283 > URL: https://issues.apache.org/jira/browse/SPARK-39283 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 3.1.2 >Reporter: Sandeep Pal >Assignee: Sandeep Pal >Priority: Critical > Labels: Deadlock, spark3.0 > Fix For: 3.0.4, 3.3.0, 3.1.4, 3.2.2 > > Attachments: DeadlockSparkTasks.png > > > We are seems this deadlock between {{TaskMemoryManager}} and > {{UnsafeExternalSorter}} pretty often on our workload. Sometime, the retry is > successful but sometimes we have to do hacky ways to break the deadlocks such > as turning down the worker machines explicitly. > Below is the thread dump from the Spark UI showing the deadlock : > !DeadlockSparkTasks.png! > > I believe there was a related Jira on the similar deadlock between the same > threads and it was resolved. > https://issues.apache.org/jira/browse/SPARK-27338 > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39361) Stop using Log4J2's extended throwable logging pattern in default logging configurations
[ https://issues.apache.org/jira/browse/SPARK-39361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-39361. Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 36747 [https://github.com/apache/spark/pull/36747] > Stop using Log4J2's extended throwable logging pattern in default logging > configurations > > > Key: SPARK-39361 > URL: https://issues.apache.org/jira/browse/SPARK-39361 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Major > Fix For: 3.3.0 > > > This PR addresses a performance problem in Log4J 2 related to exception > logging: in certain scenarios I observed that Log4J2's exception stacktrace > logging can be ~10x slower than Log4J 1. > The problem stems from a new log pattern format in Log4J2 called ["extended > exception"|https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternExtendedException], > which enriches the regular stacktrace string with information on the name of > the JAR files that contained the classes in each stack frame. > Log4J queries the classloader to determine the source JAR for each class. > This isn't cheap, but this information is cached and reused in future > exception logging calls. In certain scenarios involving runtime-generated > classes, this lookup will fail and the failed lookup result will _not_ be > cached. As a result, expensive classloading operations will be performed > every time such an exception is logged. In addition to being very slow, these > operations take out a lock on the classloader and thus can cause severe lock > contention if multiple threads are logging errors. This issue is described in > more detail in a comment on a Log4J2 JIRA and in a linked blogpost: > https://issues.apache.org/jira/browse/LOG4J2-2391?focusedCommentId=16667140&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16667140 > . Spark frequently uses generated classes and lambdas and thus Spark > executor logs will almost always trigger this edge-case and suffer from poor > performance. > By default, if you do not specify an explicit exception format in your > logging pattern then Log4J2 will add this "extended exception" pattern (see > PatternLayout's {{alwaysWriteExceptions}} flag in Log4J's documentation, plus > [the code implementing that > flag|https://github.com/apache/logging-log4j2/blob/d6c8ab0863c551cdf0f8a5b1966ab45e3cddf572/log4j-core/src/main/java/org/apache/logging/log4j/core/pattern/PatternParser.java#L206-L209] > in Log4J2). > In this PR, I have updated Spark's default Log4J2 configurations so that each > pattern layout includes an explicit {{%ex}} so that it uses the normal > (non-extended) exception logging format. > Although it's true that any program logging exceptions at a high rate should > probably just fix the source of the exceptions, I think it's still a good > idea for us to try to fix this out-of-the-box performance difference so that > users' existing workloads do not regress when upgrading to 3.3.0. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545512#comment-17545512 ] Apache Spark commented on SPARK-29260: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/36750 > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29260: Assignee: Apache Spark > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29260: Assignee: (was: Apache Spark) > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29260) Enable supported Hive metastore versions once it support altering database location
[ https://issues.apache.org/jira/browse/SPARK-29260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545511#comment-17545511 ] Apache Spark commented on SPARK-29260: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/36750 > Enable supported Hive metastore versions once it support altering database > location > --- > > Key: SPARK-29260 > URL: https://issues.apache.org/jira/browse/SPARK-29260 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Hive 3.x is supported currently. Hive 2.2.1 and Hive 2.4.0 have not released. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39366) BlockInfoManager should not release write locks on task end
Herman van Hövell created SPARK-39366: - Summary: BlockInfoManager should not release write locks on task end Key: SPARK-39366 URL: https://issues.apache.org/jira/browse/SPARK-39366 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: Herman van Hövell Assignee: Herman van Hövell The BlockInfoManager releases all locks held by a task when the task is done. It also release write locks, the problem with that is that a thread (other than the main task thread) might still be modifying the block. By releasing it the block now seems readable, and a reader might observe a block in a partial or non-existent state. This is fortunately not the massive problem as it appears to be because the BlockManager (only place where we write blocks) is well behaved and always puts the block in a consistent state. This means the errors caused by this are only transient. Given the fact that the write code is well behaved we don't need to release the write locks on task end. We should remove that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39365) Truncate fragment of query context if it is too long
[ https://issues.apache.org/jira/browse/SPARK-39365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-39365: --- Summary: Truncate fragment of query context if it is too long (was: Truncate fragment if it is too long) > Truncate fragment of query context if it is too long > > > Key: SPARK-39365 > URL: https://issues.apache.org/jira/browse/SPARK-39365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39365) Truncate fragment if it is too long
Gengliang Wang created SPARK-39365: -- Summary: Truncate fragment if it is too long Key: SPARK-39365 URL: https://issues.apache.org/jira/browse/SPARK-39365 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39364) Logs are still "in_progress" when we read HDFS files
Marek Czuma created SPARK-39364: --- Summary: Logs are still "in_progress" when we read HDFS files Key: SPARK-39364 URL: https://issues.apache.org/jira/browse/SPARK-39364 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.2 Reporter: Marek Czuma When I read file from HDFS using hadoop FileSystem after SparkSession initializing, logs are still "in_progress" in event log. In Spark History Server logs from this application are empty (no jobs, no stages etc.). To be clear: job is finished successfully, but in Spark History Server, we have this strange situation. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning
[ https://issues.apache.org/jira/browse/SPARK-39363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545479#comment-17545479 ] Thomas Graves commented on SPARK-39363: --- [~Kimahriman] > fix spark.kubernetes.memoryOverheadFactor deprecation warning > - > > Key: SPARK-39363 > URL: https://issues.apache.org/jira/browse/SPARK-39363 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Thomas Graves >Priority: Major > > see [https://github.com/apache/spark/pull/36744] for details. > > It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it > has a default value which leads it to printing warnings all the time.}} > {{}} > {code:java} > 22/06/01 23:53:49 WARN SparkConf: The configuration key > 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 > and may be removed in the future. Please use > spark.driver.memoryOverheadFactor and > spark.executor.memoryOverheadFactor{code} > {{}} > {{We should remove the default value if possible. It should only be used as > fallback but we should be able to use the default from }} > spark.driver.memoryOverheadFactor. > {{{}{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning
[ https://issues.apache.org/jira/browse/SPARK-39363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-39363: -- Description: see [https://github.com/apache/spark/pull/36744] for details. It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has a default value which leads it to printing warnings all the time.}} {{}} {code:java} 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor{code} {{}} {{We should remove the default value if possible. It should only be used as fallback but we should be able to use the default from }}spark.driver.memoryOverheadFactor. was: see [https://github.com/apache/spark/pull/36744] for details. It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has a default value which leads it to printing warnings all the time.}} {{}} {code:java} 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor{code} {{}} {{We should remove the default value if possible. It should only be used as fallback but we should be able to use the default from }} spark.driver.memoryOverheadFactor. {{{}{}}}{{{}{}}} > fix spark.kubernetes.memoryOverheadFactor deprecation warning > - > > Key: SPARK-39363 > URL: https://issues.apache.org/jira/browse/SPARK-39363 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.4.0 >Reporter: Thomas Graves >Priority: Major > > see [https://github.com/apache/spark/pull/36744] for details. > > It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it > has a default value which leads it to printing warnings all the time.}} > {{}} > {code:java} > 22/06/01 23:53:49 WARN SparkConf: The configuration key > 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 > and may be removed in the future. Please use > spark.driver.memoryOverheadFactor and > spark.executor.memoryOverheadFactor{code} > {{}} > {{We should remove the default value if possible. It should only be used as > fallback but we should be able to use the default from > }}spark.driver.memoryOverheadFactor. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39363) fix spark.kubernetes.memoryOverheadFactor deprecation warning
Thomas Graves created SPARK-39363: - Summary: fix spark.kubernetes.memoryOverheadFactor deprecation warning Key: SPARK-39363 URL: https://issues.apache.org/jira/browse/SPARK-39363 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.4.0 Reporter: Thomas Graves see [https://github.com/apache/spark/pull/36744] for details. It looks like we deprecated {{spark.kubernetes.memoryOverheadFactor, but it has a default value which leads it to printing warnings all the time.}} {{}} {code:java} 22/06/01 23:53:49 WARN SparkConf: The configuration key 'spark.kubernetes.memoryOverheadFactor' has been deprecated as of Spark 3.3.0 and may be removed in the future. Please use spark.driver.memoryOverheadFactor and spark.executor.memoryOverheadFactor{code} {{}} {{We should remove the default value if possible. It should only be used as fallback but we should be able to use the default from }} spark.driver.memoryOverheadFactor. {{{}{}}}{{{}{}}} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38807) Error when starting spark shell on Windows system
[ https://issues.apache.org/jira/browse/SPARK-38807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-38807: Assignee: Ming Li > Error when starting spark shell on Windows system > - > > Key: SPARK-38807 > URL: https://issues.apache.org/jira/browse/SPARK-38807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Ming Li >Assignee: Ming Li >Priority: Major > Fix For: 3.2.2, 3.3.1 > > > Using the release version of spark-3.2.1 and the default configuration, > starting spark shell on Windows system fails. (spark 3.1.2 doesn't show this > issue) > Here is the stack trace of the exception: > {code:java} > 22/04/06 21:47:45 ERROR SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ... > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.net.URISyntaxException: Illegal character in path at index > 30: spark://192.168.X.X:56964/F:\classes > at java.net.URI$Parser.fail(URI.java:2845) > at java.net.URI$Parser.checkChars(URI.java:3018) > at java.net.URI$Parser.parseHierarchical(URI.java:3102) > at java.net.URI$Parser.parse(URI.java:3050) > at java.net.URI.(URI.java:588) > at > org.apache.spark.repl.ExecutorClassLoader.(ExecutorClassLoader.scala:57) > ... 70 more > 22/04/06 21:47:45 ERROR Utils: Uncaught exception in thread main > java.lang.NullPointerException > ... {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38807) Error when starting spark shell on Windows system
[ https://issues.apache.org/jira/browse/SPARK-38807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-38807. -- Fix Version/s: 3.3.1 3.2.2 Resolution: Fixed Issue resolved by pull request 36447 [https://github.com/apache/spark/pull/36447] > Error when starting spark shell on Windows system > - > > Key: SPARK-38807 > URL: https://issues.apache.org/jira/browse/SPARK-38807 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.1 >Reporter: Ming Li >Priority: Major > Fix For: 3.3.1, 3.2.2 > > > Using the release version of spark-3.2.1 and the default configuration, > starting spark shell on Windows system fails. (spark 3.1.2 doesn't show this > issue) > Here is the stack trace of the exception: > {code:java} > 22/04/06 21:47:45 ERROR SparkContext: Error initializing SparkContext. > java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > ... > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.net.URISyntaxException: Illegal character in path at index > 30: spark://192.168.X.X:56964/F:\classes > at java.net.URI$Parser.fail(URI.java:2845) > at java.net.URI$Parser.checkChars(URI.java:3018) > at java.net.URI$Parser.parseHierarchical(URI.java:3102) > at java.net.URI$Parser.parse(URI.java:3050) > at java.net.URI.(URI.java:588) > at > org.apache.spark.repl.ExecutorClassLoader.(ExecutorClassLoader.scala:57) > ... 70 more > 22/04/06 21:47:45 ERROR Utils: Uncaught exception in thread main > java.lang.NullPointerException > ... {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown
[ https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545437#comment-17545437 ] Stephen Kestle commented on SPARK-33121: I seem to have encountered this problem without streaming in version 3.2.1. My large session has many complex stages and writes unique data. I've enabled re-runs, and so do an anti-join on already written data in the (jdbc) database. At this stage, it seems to be applying the anti-join right at the end after all the complex computations and reduces millions of rows to 0. Because I have many executors, I coalesce down to 30 partitions for writing to jdbc. I'm getting hundreds of {{RejectedExecutionExceptions}} - it seems to me that the coalescing starts, and simultaneously it determines 0 rows and finishes the write and exits, resulting in non-graceful shutdown. Calling {{sc.stop()}} does nothing, but {{df.cache()}} before coalescing and writing does. Should this be reported as a separate ticket? I asked on [gitter|https://gitter.im/spark-scala/Lobby?at=6298a15306a77e1e18684826] too, and thought this actually did seem similar enough to comment. > Spark Streaming 3.1.1 hangs on shutdown > --- > > Key: SPARK-33121 > URL: https://issues.apache.org/jira/browse/SPARK-33121 > Project: Spark > Issue Type: Bug > Components: DStreams >Affects Versions: 3.1.1 >Reporter: Dmitry Tverdokhleb >Priority: Major > Labels: Streaming, hang, shutdown > > Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem > in graceful shutdown. > Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true". > Here is the code: > {code:java} > inputStream.foreachRDD { > rdd => > rdd.foreachPartition { > Thread.sleep(5000) > } > } > {code} > I send a SIGTERM signal to stop the spark streaming and after sleeping an > exception arises: > {noformat} > streaming-agg-tds-data_1 | java.util.concurrent.RejectedExecutionException: > Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from > java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, > active threads = 0, queued tasks = 0, completed tasks = 1] > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379) > streaming-agg-tds-data_1 | at > org.apache.spark.executor.Executor.launchTask(Executor.scala:270) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91) > streaming-agg-tds-data_1 | at > scala.collection.Iterator.foreach(Iterator.scala:941) > streaming-agg-tds-data_1 | at > scala.collection.Iterator.foreach$(Iterator.scala:941) > streaming-agg-tds-data_1 | at > scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > streaming-agg-tds-data_1 | at > scala.collection.IterableLike.foreach(IterableLike.scala:74) > streaming-agg-tds-data_1 | at > scala.collection.IterableLike.foreach$(IterableLike.scala:73) > streaming-agg-tds-data_1 | at > scala.collection.AbstractIterable.foreach(Iterable.scala:56) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91) > streaming-agg-tds-data_1 | at > org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > streaming-agg-tds-data_1 | at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > streaming-agg-tds-data_1 | at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > streaming-agg-tds-data_1 | at java.lang.Thread.run(Thread.java:748) > streaming-agg-tds-data_1 | 2021-04-22 13:33:41 WA
[jira] [Resolved] (SPARK-39354) The analysis exception is incorrect
[ https://issues.apache.org/jira/browse/SPARK-39354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-39354. -- Fix Version/s: 3.3.0 3.4.0 Resolution: Fixed Issue resolved by pull request 36746 [https://github.com/apache/spark/pull/36746] > The analysis exception is incorrect > --- > > Key: SPARK-39354 > URL: https://issues.apache.org/jira/browse/SPARK-39354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0, 3.4.0 > > > {noformat} > scala> spark.sql("create table t1(user_id int, auct_end_dt date) using > parquet;") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("select * from t1 join t2 on t1.user_id = t2.user_id where > t1.auct_end_dt >= Date_sub('2020-12-27', 90)").show > org.apache.spark.sql.AnalysisException: cannot resolve > 'date_sub('2020-12-27', 90)' due to data type mismatch: argument 1 requires > date type, however, ''2020-12-27'' is of string type.; line 1 pos 76 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4334) > at > org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82$adapted(Analyzer.scala:4327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:365) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:364) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:364) > {noformat} > The analysis exception should be: > {noformat} > org.apache.spark.sql.AnalysisException: Table or view not found: t2 > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39354) The analysis exception is incorrect
[ https://issues.apache.org/jira/browse/SPARK-39354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-39354: Assignee: Yang Jie > The analysis exception is incorrect > --- > > Key: SPARK-39354 > URL: https://issues.apache.org/jira/browse/SPARK-39354 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yuming Wang >Assignee: Yang Jie >Priority: Minor > > {noformat} > scala> spark.sql("create table t1(user_id int, auct_end_dt date) using > parquet;") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("select * from t1 join t2 on t1.user_id = t2.user_id where > t1.auct_end_dt >= Date_sub('2020-12-27', 90)").show > org.apache.spark.sql.AnalysisException: cannot resolve > 'date_sub('2020-12-27', 90)' due to data type mismatch: argument 1 requires > date type, however, ''2020-12-27'' is of string type.; line 1 pos 76 > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82(Analyzer.scala:4334) > at > org.apache.spark.sql.catalyst.analysis.RemoveTempResolvedColumn$.$anonfun$apply$82$adapted(Analyzer.scala:4327) > at > org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:365) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1(TreeNode.scala:364) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$foreachUp$1$adapted(TreeNode.scala:364) > {noformat} > The analysis exception should be: > {noformat} > org.apache.spark.sql.AnalysisException: Table or view not found: t2 > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39362) Datasource v2 scan not pruned in tpcds q10
Zhen Wang created SPARK-39362: - Summary: Datasource v2 scan not pruned in tpcds q10 Key: SPARK-39362 URL: https://issues.apache.org/jira/browse/SPARK-39362 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Environment: Spark 3.2 + Iceberg 3.2 Reporter: Zhen Wang {code:java} createAndInitTable("id INT, dep STRING"); createAndInitTable( wideTableName, "id INT, c1 STRING, c2 STRING, c3 STRING, c4 STRING, c5 STRING", null); String q1 = String.format("select dep from %s t1 where id > 1 and exists " + "(select * from %s t2 where t1.id = t2.id)", tableName, wideTableName); Assert.assertEquals("should prune scan", 1, scanCols); {code} it looks like select * in subqueries is not fully projected when v2ScanRelationPushdown is applied. after it is later projected the scan wont be changed after that. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39346) Convert asserts/illegal state exception to internal errors on each phase
[ https://issues.apache.org/jira/browse/SPARK-39346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-39346: - Fix Version/s: 3.3.1 > Convert asserts/illegal state exception to internal errors on each phase > > > Key: SPARK-39346 > URL: https://issues.apache.org/jira/browse/SPARK-39346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0, 3.3.1 > > > Wrap assert/illegal state exception by internal errors on each phase of query > execution. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39295) Improve documentation of pandas API support list.
[ https://issues.apache.org/jira/browse/SPARK-39295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545361#comment-17545361 ] Apache Spark commented on SPARK-39295: -- User 'beobest2' has created a pull request for this issue: https://github.com/apache/spark/pull/36749 > Improve documentation of pandas API support list. > - > > Key: SPARK-39295 > URL: https://issues.apache.org/jira/browse/SPARK-39295 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyunwoo Park >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > The description provided in the supported pandas API list document or the > code comment needs improvement. Also, there are cases where the link of the > function property provided in the document is not connected, so it needs to > be corrected. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks
[ https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339 ] oskarryn edited comment on SPARK-33320 at 6/2/22 8:48 AM: -- I'm afraid the ExecutorMetrics namespace only features from Spark 3. While on Spark 2.4.8 I'm also missing ExecutorMetrics. Looking into the code, the parameter `spark.metrics.executorMetricsSource.enabled` was added from Spark 3. Also, the class ExecutorMetrics isn't even defined on Spark's 2.4 branch. was (Author: oskarryn): Executors should forward metrics to the driver and we should see ExecutorMetrics namespace in the driver component metrics. I'm afraid this feature is available from Spark 3 only. While on Spark 2.4.8 I'm also missing ExecutorMetrics. Looking into the code, the parameter `spark.metrics.executorMetricsSource.enabled` was added from Spark 3. Also, the class ExecutorMetrics isn't even defined on Spark's 2.4 branch. > ExecutorMetrics are not written to CSV and StatsD sinks > --- > > Key: SPARK-33320 > URL: https://issues.apache.org/jira/browse/SPARK-33320 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant > part of the config is below: > {noformat} > spark.metrics.executorMetricsSource.enabled=true > spark.eventLog.logStageExecutorMetrics=true > spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink > spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet > spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json > spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink > spark.metrics.conf.*.sink.statsd.host=localhost > spark.metrics.conf.*.sink.statsd.port=8125 > spark.metrics.conf.*.sink.statsd.period=10 > spark.metrics.conf.*.sink.statsd.unit=seconds > spark.metrics.conf.*.sink.statsd.prefix=spark > master.sink.servlet.path=/home/hadoop/metrics/master/json > applications.sink.servlet.path=/home/hadoop/metrics/applications/json > {noformat} >Reporter: Peter Podlovics >Priority: Major > > Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and > StatsD sinks, even though some of them is available through the REST API > (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}). > I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-38675) Race condition in BlockInfoManager during unlock
[ https://issues.apache.org/jira/browse/SPARK-38675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-38675. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 35991 [https://github.com/apache/spark/pull/35991] > Race condition in BlockInfoManager during unlock > > > Key: SPARK-38675 > URL: https://issues.apache.org/jira/browse/SPARK-38675 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Blocker > Fix For: 3.3.0 > > > There is a race condition between unlock and releaseAllLocksForTask in the > block manager. This can lead to negative reader counts (which trip an > assertion). -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks
[ https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339 ] oskarryn edited comment on SPARK-33320 at 6/2/22 8:40 AM: -- Executors should forward metrics to the driver and we should see ExecutorMetrics namespace in the driver component metrics. I'm afraid this feature is available from Spark 3 only. While on Spark 2.4.8 I'm also missing ExecutorMetrics. Looking into the code, the parameter `spark.metrics.executorMetricsSource.enabled` was added from Spark 3. Also, the class ExecutorMetrics isn't even defined on Spark's 2.4 branch. was (Author: oskarryn): I have the same issue while on Spark 2.4.8. Supposedly executors forward metrics to the driver and we should see ExecutorMetrics namespace in the driver component metrics, but sadly no executor metrics are available there. I think this feature is available only from Spark 3 (it works then). On 2.4 branch there isn't even ExecutorMetrics class defined. > ExecutorMetrics are not written to CSV and StatsD sinks > --- > > Key: SPARK-33320 > URL: https://issues.apache.org/jira/browse/SPARK-33320 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant > part of the config is below: > {noformat} > spark.metrics.executorMetricsSource.enabled=true > spark.eventLog.logStageExecutorMetrics=true > spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink > spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet > spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json > spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink > spark.metrics.conf.*.sink.statsd.host=localhost > spark.metrics.conf.*.sink.statsd.port=8125 > spark.metrics.conf.*.sink.statsd.period=10 > spark.metrics.conf.*.sink.statsd.unit=seconds > spark.metrics.conf.*.sink.statsd.prefix=spark > master.sink.servlet.path=/home/hadoop/metrics/master/json > applications.sink.servlet.path=/home/hadoop/metrics/applications/json > {noformat} >Reporter: Peter Podlovics >Priority: Major > > Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and > StatsD sinks, even though some of them is available through the REST API > (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}). > I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks
[ https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339 ] oskarryn edited comment on SPARK-33320 at 6/2/22 8:35 AM: -- I have the same issue while on Spark 2.4.8. Supposedly executors forward metrics to the driver and we should see ExecutorMetrics namespace in the driver component metrics, but sadly no executor metrics are available there. I think this feature is available only from Spark 3 (it works then). On 2.4 branch there isn't even ExecutorMetrics class defined. was (Author: oskarryn): I have the same issue while on Spark 2.4.8. Supposedly executors forward metrics to the driver and we should see ExecutorMetrics namespace in the driver component metrics, but sadly no executor metrics are available there. > ExecutorMetrics are not written to CSV and StatsD sinks > --- > > Key: SPARK-33320 > URL: https://issues.apache.org/jira/browse/SPARK-33320 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant > part of the config is below: > {noformat} > spark.metrics.executorMetricsSource.enabled=true > spark.eventLog.logStageExecutorMetrics=true > spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink > spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet > spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json > spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink > spark.metrics.conf.*.sink.statsd.host=localhost > spark.metrics.conf.*.sink.statsd.port=8125 > spark.metrics.conf.*.sink.statsd.period=10 > spark.metrics.conf.*.sink.statsd.unit=seconds > spark.metrics.conf.*.sink.statsd.prefix=spark > master.sink.servlet.path=/home/hadoop/metrics/master/json > applications.sink.servlet.path=/home/hadoop/metrics/applications/json > {noformat} >Reporter: Peter Podlovics >Priority: Major > > Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and > StatsD sinks, even though some of them is available through the REST API > (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}). > I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33320) ExecutorMetrics are not written to CSV and StatsD sinks
[ https://issues.apache.org/jira/browse/SPARK-33320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545339#comment-17545339 ] oskarryn commented on SPARK-33320: -- I have the same issue while on Spark 2.4.8. Supposedly executors forward metrics to the driver and we should see ExecutorMetrics namespace in the driver component metrics, but sadly no executor metrics are available there. > ExecutorMetrics are not written to CSV and StatsD sinks > --- > > Key: SPARK-33320 > URL: https://issues.apache.org/jira/browse/SPARK-33320 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 > Environment: I was using Spark 2.4.4 on EMR with YARN. The relevant > part of the config is below: > {noformat} > spark.metrics.executorMetricsSource.enabled=true > spark.eventLog.logStageExecutorMetrics=true > spark.metrics.conf.*.sink.csv.class=org.apache.spark.metrics.sink.CsvSink > spark.metrics.conf.*.sink.servlet.class=org.apache.spark.metrics.sink.MetricsServlet > spark.metrics.conf.*.sink.servlet.path=/home/hadoop/metrics/json > spark.metrics.conf.*.sink.statsd.class=org.apache.spark.metrics.sink.StatsdSink > spark.metrics.conf.*.sink.statsd.host=localhost > spark.metrics.conf.*.sink.statsd.port=8125 > spark.metrics.conf.*.sink.statsd.period=10 > spark.metrics.conf.*.sink.statsd.unit=seconds > spark.metrics.conf.*.sink.statsd.prefix=spark > master.sink.servlet.path=/home/hadoop/metrics/master/json > applications.sink.servlet.path=/home/hadoop/metrics/applications/json > {noformat} >Reporter: Peter Podlovics >Priority: Major > > Metrics from the {{ExecutorMetrics}} namespace are not written to the CSV and > StatsD sinks, even though some of them is available through the REST API > (e.g.: {{memoryMetrics.usedOnHeapStorageMemory}}). > I couldn't find the {{ExecutorMetrics}} either on the driver or the workers. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list
[ https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545296#comment-17545296 ] Apache Spark commented on SPARK-38961: -- User 'beobest2' has created a pull request for this issue: https://github.com/apache/spark/pull/36748 > Enhance to automatically generate the pandas API support list > - > > Key: SPARK-38961 > URL: https://issues.apache.org/jira/browse/SPARK-38961 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > Currently, the supported pandas API list is manually maintained, so it would > be better to make the list automatically generated to reduce the maintenance > cost. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38961) Enhance to automatically generate the pandas API support list
[ https://issues.apache.org/jira/browse/SPARK-38961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17545295#comment-17545295 ] Apache Spark commented on SPARK-38961: -- User 'beobest2' has created a pull request for this issue: https://github.com/apache/spark/pull/36748 > Enhance to automatically generate the pandas API support list > - > > Key: SPARK-38961 > URL: https://issues.apache.org/jira/browse/SPARK-38961 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Hyunwoo Park >Priority: Major > Fix For: 3.4.0 > > > Currently, the supported pandas API list is manually maintained, so it would > be better to make the list automatically generated to reduce the maintenance > cost. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org