[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF
[ https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752049#comment-17752049 ] Manu Zhang commented on SPARK-44719: Is there a 2.3.10 release? > NoClassDefFoundError when using Hive UDF > > > Key: SPARK-44719 > URL: https://issues.apache.org/jira/browse/SPARK-44719 > Project: Spark > Issue Type: Bug > Components: Build, SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > Attachments: HiveUDFs-1.0-SNAPSHOT.jar > > > How to reproduce: > {noformat} > spark-sql (default)> add jar > /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; > Time taken: 0.413 seconds > spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as > 'net.petrabarus.hiveudfs.LongToIP'; > Time taken: 0.038 seconds > spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); > 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT > long_to_ip(2130706433L) FROM range(10)] > java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory > at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:348) > ... > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Summary: Spark application hangs when YarnAllocator adds running executors after processing completed containers (was: Spark application hangs when YarnAllocator adding running executors after processing completed containers) > Spark application hangs when YarnAllocator adds running executors after > processing completed containers > --- > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator added the running executor after processing completed > containers. The former happens in a separate thread after executor launch. > YarnAllocator believes there are still running executors, although they are > already lost due to preemption. Hence, the application hangs without any > running executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Description: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted.{code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator added the running executor after processing completed containers. The former happens in a separate thread after executor launch. YarnAllocator believes there are still running executors, although they are already lost due to preemption. Hence, the application hangs without any running executors. was: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted.{code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. YarnAllocator thought > Spark application hangs when YarnAllocator adding running executors after > processing completed containers > - > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator added the running executor after processing completed > containers. The former happens in a separate thread after executor launch. > YarnAllocator believes there are still running executors, although they are > already lost due to preemption. Hence, the application hangs without any > running executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Summary: Spark application hangs when YarnAllocator adding running executors after processing completed containers (was: Spark application hangs when YarnAllocator processing completed containers before updating internal state) > Spark application hangs when YarnAllocator adding running executors after > processing completed containers > - > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator processing completed container before updating internal state > and adding the executorId. The latter happens in a separate thread after > executor launch. > YarnAllocator thought -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state
[ https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-43510: --- Description: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted.{code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. YarnAllocator thought was: I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted. {code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. > Spark application hangs when YarnAllocator processing completed containers > before updating internal state > - > > Key: SPARK-43510 > URL: https://issues.apache.org/jira/browse/SPARK-43510 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 3.4.0 >Reporter: Manu Zhang >Priority: Major > > I see application hangs when containers are preempted immediately after > allocation as follows. > {code:java} > 23/05/14 09:11:33 INFO YarnAllocator: Launching container > container_e3812_1684033797982_57865_01_000382 on host > hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with > ID 277 for ResourceProfile Id 0 > 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: > container_e3812_1684033797982_57865_01_000382 > 23/05/14 09:11:33 INFO YarnAllocator: Completed container > container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: > -102) > 23/05/14 09:11:33 INFO YarnAllocator: Container > container_e3812_1684033797982_57865_01_000382 was preempted.{code} > Note the warning log where YarnAllocator cannot find executorId for the > container when processing completed containers. The only plausible cause is > YarnAllocator processing completed container before updating internal state > and adding the executorId. The latter happens in a separate thread after > executor launch. > YarnAllocator thought -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state
Manu Zhang created SPARK-43510: -- Summary: Spark application hangs when YarnAllocator processing completed containers before updating internal state Key: SPARK-43510 URL: https://issues.apache.org/jira/browse/SPARK-43510 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 3.4.0 Reporter: Manu Zhang I see application hangs when containers are preempted immediately after allocation as follows. {code:java} 23/05/14 09:11:33 INFO YarnAllocator: Launching container container_e3812_1684033797982_57865_01_000382 on host hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 277 for ResourceProfile Id 0 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: container_e3812_1684033797982_57865_01_000382 23/05/14 09:11:33 INFO YarnAllocator: Completed container container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: -102) 23/05/14 09:11:33 INFO YarnAllocator: Container container_e3812_1684033797982_57865_01_000382 was preempted. {code} Note the warning log where YarnAllocator cannot find executorId for the container when processing completed containers. The only plausible cause is YarnAllocator processing completed container before updating internal state and adding the executorId. The latter happens in a separate thread after executor launch. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39430) The inconsistent timezone in Spark History Server UI
[ https://issues.apache.org/jira/browse/SPARK-39430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700074#comment-17700074 ] Manu Zhang commented on SPARK-39430: We are using Spark 3.1.1. All the UI pages are in server timezone except for task LaunchTime which is in client timezone due to https://issues.apache.org/jira/browse/SPARK-32068 > The inconsistent timezone in Spark History Server UI > > > Key: SPARK-39430 > URL: https://issues.apache.org/jira/browse/SPARK-39430 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Submit, Web UI >Affects Versions: 3.2.1 >Reporter: Surbhi >Priority: Major > Attachments: Screenshot 2022-06-10 at 12.59.36 AM.png, Screenshot > 2022-06-10 at 12.59.50 AM.png > > > The spark history server is running in UTC timezone. But we are trying to > view history server in IST timezone. > The history server landing page shows time in IST but Jobs, Stages, Storage, > Environment, Executors tabs shows time in UTC. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39672) NotExists subquery failed with conflicting attributes
Manu Zhang created SPARK-39672: -- Summary: NotExists subquery failed with conflicting attributes Key: SPARK-39672 URL: https://issues.apache.org/jira/browse/SPARK-39672 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.3 Reporter: Manu Zhang {code:sql} select * from ( select v1.a, v1.b, v2.c from v1 inner join v2 on v1.a=v2.a) t3 where not exists ( select 1 from v2 where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c ){code} This query throws AnalysisException {code:java} org.apache.spark.sql.AnalysisException: Found conflicting attributes a#266 in the condition joining outer plan: Join Inner, (a#250 = a#266) :- Project [_1#243 AS a#250, _2#244 AS b#251] : +- LocalRelation [_1#243, _2#244, _3#245] +- Project [_1#259 AS a#266, _3#261 AS c#268] +- LocalRelation [_1#259, _2#260, _3#261]and subplan: Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277] +- LocalRelation [_1#259, _2#260, _3#261] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
[ https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557794#comment-17557794 ] Manu Zhang commented on SPARK-30696: [~maxgekk], any update on this? I find another issue in 3.1.1. DST started at 2:00 am, March 13 however the following query already counted the timestamp as DST. {code:java} select from_utc_timestamp(timestamp'2022-03-13 05:18:29.581', "US/Pacific") >> 2022-03-12 22:18:29.581 {code} > Wrong result of the combination of from_utc_timestamp and to_utc_timestamp > -- > > Key: SPARK-30696 > URL: https://issues.apache.org/jira/browse/SPARK-30696 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Max Gekk >Priority: Major > > Applying to_utc_timestamp() to results of from_utc_timestamp() should return > the original timestamp in the same time zone. In the range of 100 years, the > combination of functions returns wrong results 280 times out of 1753200: > {code:java} > scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100 > SECS_PER_YEAR: Long = 31557600 > scala> val SECS_PER_MINUTE = 60L > SECS_PER_MINUTE: Long = 60 > scala> val tz = "America/Los_Angeles" > tz: String = America/Los_Angeles > scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * > SECS_PER_MINUTE) > df: org.apache.spark.sql.Dataset[Long] = [id: bigint] > scala> val diff = > df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), > tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0) > warning: there was one deprecation warning; re-run with -deprecation for > details > diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint] > scala> diff.count > res14: Long = 280 > scala> df.count > res15: Long = 1753200 > {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39344) Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output
Manu Zhang created SPARK-39344: -- Summary: Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output Key: SPARK-39344 URL: https://issues.apache.org/jira/browse/SPARK-39344 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Manu Zhang Currently, bucketing was disabled when bucket columns are not in scan output after https://github.com/apache/spark/pull/27924. It break existing applications whose input size is huge by creating too many FilePartitions and causing driver hang. And it cannot be switched off. This is to propose merging the rule into DisableUnnecessaryBucketedScan. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
[ https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-39278: --- Description: Before [https://github.com/apache/spark/pull/23698,] The precedence of configuring Hadoop Filesystems to access is {code:java} spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} Afterwards, it's {code:java} spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems{code} When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR will break backward compatibility and cause application failure. was: Before [https://github.com/apache/spark/pull/23698,] The precedence of configuring Hadoop Filesystems to access is {code:java} spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} Afterwards, it's {code:java} spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems{code} When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR breaks backward compatibility and cause application failure. > Alternative configs of Hadoop Filesystems to access break backward > compatibility > > > Key: SPARK-39278 > URL: https://issues.apache.org/jira/browse/SPARK-39278 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Manu Zhang >Priority: Minor > > Before [https://github.com/apache/spark/pull/23698,] > The precedence of configuring Hadoop Filesystems to access is > {code:java} > spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} > Afterwards, it's > {code:java} > spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> > spark.yarn.access.hadoopFileSystems{code} > When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes > are configured with different values, the PR will break backward > compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility
Manu Zhang created SPARK-39278: -- Summary: Alternative configs of Hadoop Filesystems to access break backward compatibility Key: SPARK-39278 URL: https://issues.apache.org/jira/browse/SPARK-39278 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Manu Zhang Before [https://github.com/apache/spark/pull/23698,] The precedence of configuring Hadoop Filesystems to access is {code:java} spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code} Afterwards, it's {code:java} spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> spark.yarn.access.hadoopFileSystems{code} When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes are configured with different values, the PR breaks backward compatibility and cause application failure. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39238) Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision
[ https://issues.apache.org/jira/browse/SPARK-39238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-39238: --- Description: The result of following SQL is 1.00 while 1. is expected {code:java} CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v) CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v FROM t3; SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT v / v FROM t4; {code} was: The result of following SQL is 1.00 while 1. is expected {code:java} CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v FROM t3; SELECT v / v FROM t4; SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT v / v FROM t4; {code} > Decimal precision loss from applying WidenSetOperationTypes before > DecimalPrecision > > > Key: SPARK-39238 > URL: https://issues.apache.org/jira/browse/SPARK-39238 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Manu Zhang >Priority: Major > > The result of following SQL is 1.00 while 1. is expected > {code:java} > CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v) > CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v > FROM t3; > SELECT CAST(1 AS DECIMAL(28, 2)) > UNION ALL > SELECT v / v FROM t4; {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39238) Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision
[ https://issues.apache.org/jira/browse/SPARK-39238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-39238: --- Description: The result of following SQL is 1.00 while 1. is expected {code:java} CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v FROM t3; SELECT v / v FROM t4; SELECT CAST(1 AS DECIMAL(28, 2)) UNION ALL SELECT v / v FROM t4; {code} > Decimal precision loss from applying WidenSetOperationTypes before > DecimalPrecision > > > Key: SPARK-39238 > URL: https://issues.apache.org/jira/browse/SPARK-39238 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Manu Zhang >Priority: Major > > The result of following SQL is 1.00 while 1. is expected > {code:java} > CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v > FROM t3; > SELECT v / v FROM t4; > SELECT CAST(1 AS DECIMAL(28, 2)) > UNION ALL > SELECT v / v FROM t4; {code} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39238) Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision
Manu Zhang created SPARK-39238: -- Summary: Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision Key: SPARK-39238 URL: https://issues.apache.org/jira/browse/SPARK-39238 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.1 Reporter: Manu Zhang -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string
[ https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-38211: --- Summary: Add SQL migration guide on restoring loose upcast from string (was: Add SQL migration guide on preserving loose upcast ) > Add SQL migration guide on restoring loose upcast from string > - > > Key: SPARK-38211 > URL: https://issues.apache.org/jira/browse/SPARK-38211 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.2.1 >Reporter: Manu Zhang >Priority: Minor > > After SPARK-24586, loose upcasting from string to other types are not allowed > by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to > preserve old behavior but it's not documented in the SQL migration guide. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38211) Add SQL migration guide on preserving loose upcast
Manu Zhang created SPARK-38211: -- Summary: Add SQL migration guide on preserving loose upcast Key: SPARK-38211 URL: https://issues.apache.org/jira/browse/SPARK-38211 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.2.1 Reporter: Manu Zhang After SPARK-24586, loose upcasting from string to other types are not allowed by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to preserve old behavior but it's not documented in the SQL migration guide. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38207) Add bucketed scan behavior change to migration guide
Manu Zhang created SPARK-38207: -- Summary: Add bucketed scan behavior change to migration guide Key: SPARK-38207 URL: https://issues.apache.org/jira/browse/SPARK-38207 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.1.2 Reporter: Manu Zhang Default behavior of bucketed scan is changed in https://issues.apache.org/jira/browse/SPARK-32859 but it's not mentioned in SQL migration guide. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled
Manu Zhang created SPARK-37633: -- Summary: Unwrap cast should skip if downcast failed with ansi enabled Key: SPARK-37633 URL: https://issues.apache.org/jira/browse/SPARK-37633 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0, 3.1.2, 3.0.3 Reporter: Manu Zhang Currently, unwrap cast throws ArithmeticException if down cast failed with ansi enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we should always skip on failure regardless of ansi config. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34714) collect_list(struct()) fails when used with GROUP BY
[ https://issues.apache.org/jira/browse/SPARK-34714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450236#comment-17450236 ] Manu Zhang commented on SPARK-34714: FYI, this was resolved by https://issues.apache.org/jira/browse/SPARK-34713 and https://issues.apache.org/jira/browse/SPARK-34749 > collect_list(struct()) fails when used with GROUP BY > > > Key: SPARK-34714 > URL: https://issues.apache.org/jira/browse/SPARK-34714 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 > Environment: Databricks Runtime 8.0 >Reporter: Lauri Koobas >Priority: Major > Fix For: 3.1.2 > > > The following is failing in DBR8.0 / Spark 3.1.1, but works in earlier DBR > and Spark versions: > {quote}with step_1 as ( > select 'E' as name, named_struct('subfield', 1) as field_1 > ) > select name, collect_list(struct(field_1.subfield)) > from step_1 > group by 1 > {quote} > Fails with the following error message: > {quote}AnalysisException: cannot resolve > 'struct(step_1.`field_1`.`subfield`)' due to data type mismatch: Only > foldable string expressions are allowed to appear at odd position, got: > NamePlaceholder > {quote} > If you modify the query in any of the following ways then it still works:: > * if you remove the field "name" and the "group by 1" part of the query > * if you remove the "struct()" from within the collect_list() > * if you use "named_struct()" instead of "struct()" within the collect_list() > Similarly collect_set() is broken and possibly more related functions, but I > haven't done thorough testing. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445558#comment-17445558 ] Manu Zhang commented on SPARK-31646: Only very shortly and by a small amount which I think is more like collecting delay. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420688#comment-17420688 ] Manu Zhang commented on SPARK-31646: {quote}So I will try to derive numBackLoggedConnections outside at the metrics monitoring system. Any better suggestion? {quote} This looks good. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656 ] Manu Zhang commented on SPARK-31646: [~yzhangal], Please check this comment [https://github.com/apache/spark/pull/28416#discussion_r418357988] for more background. The counter reverted in this PR was just never used, or this PR was simply to remove some dead codes. I didn't meant to use registeredConnections for anything different. It's eventually registered into ShuffleMetrics here. [https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248] {code:java} blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", shuffleServer.getRegisteredConnections()); {code} As I understand it, registeredConnections (and IdleConnections) is monitored at channel level (TransportChannelHandler) while activeConnections (blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). Hence, these metrics are kept in two places. You may register your backloggedConnections in ShuffleMetrics and update it with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics. Your understanding of executors registering with Shuffle Service is correct but I don't see how it's related to your question. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656 ] Manu Zhang edited comment on SPARK-31646 at 9/17/21, 12:40 PM: --- [~yzhangal], Please check this comment [https://github.com/apache/spark/pull/28416#discussion_r418357988] for more background. The counter reverted in this PR was just never used, or this PR was simply to remove some dead codes. I didn't meant to use registeredConnections for anything different. It's eventually registered into ShuffleMetrics here. [https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248] {code:java} blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", shuffleServer.getRegisteredConnections()); {code} As I understand it, registeredConnections (and IdleConnections) is monitored at channel level (TransportChannelHandler) while activeConnections (blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). Hence, these metrics are kept in two places. You may register your backloggedConnections in ShuffleMetrics and update it with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics. Your understanding of executors registering with Shuffle Service is correct but I don't see how it's related to your question. was (Author: mauzhang): [~yzhangal], Please check this comment [https://github.com/apache/spark/pull/28416#discussion_r418357988] for more background. The counter reverted in this PR was just never used, or this PR was simply to remove some dead codes. I didn't meant to use registeredConnections for anything different. It's eventually registered into ShuffleMetrics here. [https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248] {code:java} blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", shuffleServer.getRegisteredConnections()); {code} As I understand it, registeredConnections (and IdleConnections) is monitored at channel level (TransportChannelHandler) while activeConnections (blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). Hence, these metrics are kept in two places. You may register your backloggedConnections in ShuffleMetrics and update it with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics. Your understanding of executors registering with Shuffle Service is correct but I don't see how it's related to your question. > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
[ https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416147#comment-17416147 ] Manu Zhang commented on SPARK-31646: [~yzhangal], please check [https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportChannelHandler.java#L190] > Remove unused registeredConnections counter from ShuffleMetrics > --- > > Key: SPARK-31646 > URL: https://issues.apache.org/jira/browse/SPARK-31646 > Project: Spark > Issue Type: Improvement > Components: Deploy, Shuffle, Spark Core >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token
[ https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-35160: --- Description: Currently, when running on YARN and failing to get Hive delegation token, a Spark SQL application will still be submitted. Eventually, the application will fail on connecting to Hive metastore without a valid delegation token. Is there any reason for this design ? cc [~jerryshao] who originally implemented this in https://issues.apache.org/jira/browse/SPARK-14743 I'd propose to fail immediately like HadoopFSDelegationTokenProvider. Update: After [https://github.com/apache/spark/pull/23418], HadoopFSDelegationTokenProvider no longer fail on non fatal exception. However, the author changed the behavior just to keep it consistent with other providers. was: Currently, when running on YARN and failing to get Hive delegation token, a Spark SQL application will still be submitted. Eventually, the application will fail on connecting to Hive metastore without a valid delegation token. Is there any reason for this design ? cc [~jerryshao] who originally implemented this in https://issues.apache.org/jira/browse/SPARK-14743 I'd propose to fail immediately like HadoopFSDelegationTokenProvider. > Spark application submitted despite failing to get Hive delegation token > > > Key: SPARK-35160 > URL: https://issues.apache.org/jira/browse/SPARK-35160 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 3.1.1 >Reporter: Manu Zhang >Priority: Major > > Currently, when running on YARN and failing to get Hive delegation token, a > Spark SQL application will still be submitted. Eventually, the application > will fail on connecting to Hive metastore without a valid delegation token. > Is there any reason for this design ? > cc [~jerryshao] who originally implemented this in > https://issues.apache.org/jira/browse/SPARK-14743 > I'd propose to fail immediately like HadoopFSDelegationTokenProvider. > > Update: > After [https://github.com/apache/spark/pull/23418], > HadoopFSDelegationTokenProvider no longer fail on non fatal exception. > However, the author changed the behavior just to keep it consistent with > other providers. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
[ https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334443#comment-17334443 ] Manu Zhang commented on SPARK-32083: This issue has been fixed in https://issues.apache.org/jira/browse/SPARK-35239 > Unnecessary tasks are launched when input is empty with AQE > --- > > Key: SPARK-32083 > URL: https://issues.apache.org/jira/browse/SPARK-32083 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Wenchen Fan >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > > [https://github.com/apache/spark/pull/28226] meant to avoid launching > unnecessary tasks for 0-size partitions when AQE is enabled. However, when > all partitions are empty, the number of partitions will be > `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) > unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token
[ https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331114#comment-17331114 ] Manu Zhang commented on SPARK-35160: [~hyukjin.kwon], Thanks for reminder. I've added my proposal and I did ask about it on mailing list. It will be great if you know the reasoning behind it or you may forward to someone who knows. > Spark application submitted despite failing to get Hive delegation token > > > Key: SPARK-35160 > URL: https://issues.apache.org/jira/browse/SPARK-35160 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 3.1.1 >Reporter: Manu Zhang >Priority: Major > > Currently, when running on YARN and failing to get Hive delegation token, a > Spark SQL application will still be submitted. Eventually, the application > will fail on connecting to Hive metastore without a valid delegation token. > Is there any reason for this design ? > cc [~jerryshao] who originally implemented this in > https://issues.apache.org/jira/browse/SPARK-14743 > I'd propose to fail immediately like HadoopFSDelegationTokenProvider. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token
[ https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-35160: --- Description: Currently, when running on YARN and failing to get Hive delegation token, a Spark SQL application will still be submitted. Eventually, the application will fail on connecting to Hive metastore without a valid delegation token. Is there any reason for this design ? cc [~jerryshao] who originally implemented this in https://issues.apache.org/jira/browse/SPARK-14743 I'd propose to fail immediately like HadoopFSDelegationTokenProvider. was: Currently, when running on YARN and failing to get Hive delegation token, a Spark SQL application will still be submitted. Eventually, the application will fail on connecting to Hive metastore without a valid delegation token. Is there any reason for this design ? cc [~jerryshao] who originally implemented this in https://issues.apache.org/jira/browse/SPARK-14743 > Spark application submitted despite failing to get Hive delegation token > > > Key: SPARK-35160 > URL: https://issues.apache.org/jira/browse/SPARK-35160 > Project: Spark > Issue Type: Improvement > Components: Security >Affects Versions: 3.1.1 >Reporter: Manu Zhang >Priority: Major > > Currently, when running on YARN and failing to get Hive delegation token, a > Spark SQL application will still be submitted. Eventually, the application > will fail on connecting to Hive metastore without a valid delegation token. > Is there any reason for this design ? > cc [~jerryshao] who originally implemented this in > https://issues.apache.org/jira/browse/SPARK-14743 > I'd propose to fail immediately like HadoopFSDelegationTokenProvider. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token
Manu Zhang created SPARK-35160: -- Summary: Spark application submitted despite failing to get Hive delegation token Key: SPARK-35160 URL: https://issues.apache.org/jira/browse/SPARK-35160 Project: Spark Issue Type: Improvement Components: Security Affects Versions: 3.1.1 Reporter: Manu Zhang Currently, when running on YARN and failing to get Hive delegation token, a Spark SQL application will still be submitted. Eventually, the application will fail on connecting to Hive metastore without a valid delegation token. Is there any reason for this design ? cc [~jerryshao] who originally implemented this in https://issues.apache.org/jira/browse/SPARK-14743 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE
[ https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang resolved SPARK-32698. Resolution: Won't Do > Do not fall back to default parallelism if the minimum number of coalesced > partitions is not set in AQE > --- > > Key: SPARK-32698 > URL: https://issues.apache.org/jira/browse/SPARK-32698 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > Currently in AQE when coalescing shuffling partitions, > {quote}We fall back to Spark default parallelism if the minimum number of > coalesced partitions is not set, so to avoid perf regressions compared to no > coalescing. > {quote} > From our experience, this has resulted in a lot of uncertainty of the number > of tasks after coalescing especially with dynamic allocation, and also lead > to many small output files. It's complex and hard to reason about. > Hence, I'm proposing not falling back to the default parallelism but > coalescing towards the target size when the minimum number of coalesced > partitions is not set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32932: --- Description: With AQE, local shuffle reader breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} -Coalescing shuffle partitions could also break it.- was: With AQE, local shuffle reader breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} Coalescing shuffle partitions could also break it. > AQE local shuffle reader breaks repartitioning for dynamic partition overwrite > -- > > Key: SPARK-32932 > URL: https://issues.apache.org/jira/browse/SPARK-32932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Major > > With AQE, local shuffle reader breaks users' repartitioning for dynamic > partition overwrite as in the following case. > {code:java} > test("repartition with local reader") { > withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> > PartitionOverwriteMode.DYNAMIC.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "5", > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { > withTable("t") { > val data = for ( > i <- 1 to 10; > j <- 1 to 3 > ) yield (i, j) > data.toDF("a", "b") > .repartition($"b") > .write > .partitionBy("b") > .mode("overwrite") > .saveAsTable("t") > assert(spark.read.table("t").inputFiles.length == 3) > } > } > }{code} > -Coalescing shuffle partitions could also break it.- -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32932: --- Priority: Major (was: Minor) > AQE local shuffle reader breaks repartitioning for dynamic partition overwrite > -- > > Key: SPARK-32932 > URL: https://issues.apache.org/jira/browse/SPARK-32932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Major > > With AQE, local shuffle reader breaks users' repartitioning for dynamic > partition overwrite as in the following case. > {code:java} > test("repartition with local reader") { > withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> > PartitionOverwriteMode.DYNAMIC.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "5", > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { > withTable("t") { > val data = for ( > i <- 1 to 10; > j <- 1 to 3 > ) yield (i, j) > data.toDF("a", "b") > .repartition($"b") > .write > .partitionBy("b") > .mode("overwrite") > .saveAsTable("t") > assert(spark.read.table("t").inputFiles.length == 3) > } > } > }{code} > Coalescing shuffle partitions could also break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32932: --- Description: With AQE, local shuffle reader breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} Coalescing shuffle partitions could also break it. was: With AQE, local reader optimizer breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} Coalescing shuffle partitions could also break it. > AQE local shuffle reader breaks repartitioning for dynamic partition overwrite > -- > > Key: SPARK-32932 > URL: https://issues.apache.org/jira/browse/SPARK-32932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > With AQE, local shuffle reader breaks users' repartitioning for dynamic > partition overwrite as in the following case. > {code:java} > test("repartition with local reader") { > withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> > PartitionOverwriteMode.DYNAMIC.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "5", > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { > withTable("t") { > val data = for ( > i <- 1 to 10; > j <- 1 to 3 > ) yield (i, j) > data.toDF("a", "b") > .repartition($"b") > .write > .partitionBy("b") > .mode("overwrite") > .saveAsTable("t") > assert(spark.read.table("t").inputFiles.length == 3) > } > } > }{code} > Coalescing shuffle partitions could also break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
[ https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32932: --- Summary: AQE local shuffle reader breaks repartitioning for dynamic partition overwrite (was: AQE local reader optimizer breaks repartitioning for dynamic partition overwrite) > AQE local shuffle reader breaks repartitioning for dynamic partition overwrite > -- > > Key: SPARK-32932 > URL: https://issues.apache.org/jira/browse/SPARK-32932 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > With AQE, local reader optimizer breaks users' repartitioning for dynamic > partition overwrite as in the following case. > {code:java} > test("repartition with local reader") { > withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> > PartitionOverwriteMode.DYNAMIC.toString, > SQLConf.SHUFFLE_PARTITIONS.key -> "5", > SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { > withTable("t") { > val data = for ( > i <- 1 to 10; > j <- 1 to 3 > ) yield (i, j) > data.toDF("a", "b") > .repartition($"b") > .write > .partitionBy("b") > .mode("overwrite") > .saveAsTable("t") > assert(spark.read.table("t").inputFiles.length == 3) > } > } > }{code} > Coalescing shuffle partitions could also break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32932) AQE local reader optimizer breaks repartitioning for dynamic partition overwrite
Manu Zhang created SPARK-32932: -- Summary: AQE local reader optimizer breaks repartitioning for dynamic partition overwrite Key: SPARK-32932 URL: https://issues.apache.org/jira/browse/SPARK-32932 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang With AQE, local reader optimizer breaks users' repartitioning for dynamic partition overwrite as in the following case. {code:java} test("repartition with local reader") { withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> PartitionOverwriteMode.DYNAMIC.toString, SQLConf.SHUFFLE_PARTITIONS.key -> "5", SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") { withTable("t") { val data = for ( i <- 1 to 10; j <- 1 to 3 ) yield (i, j) data.toDF("a", "b") .repartition($"b") .write .partitionBy("b") .mode("overwrite") .saveAsTable("t") assert(spark.read.table("t").inputFiles.length == 3) } } }{code} Coalescing shuffle partitions could also break it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32753: --- Priority: Major (was: Minor) > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Major > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-32753: --- Description: To reproduce: {code:java} spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2){code} was: To reproduce: spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan)// With AQE[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2)// Without AQE[4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
Manu Zhang created SPARK-32753: -- Summary: Deduplicating and repartitioning the same column create duplicate rows with AQE Key: SPARK-32753 URL: https://issues.apache.org/jira/browse/SPARK-32753 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang To reproduce: spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan)// With AQE[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2)// Without AQE[4],[7],[0],[6],[8],[3],[2],[5],[1],[9] *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- *(1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE
Manu Zhang created SPARK-32698: -- Summary: Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE Key: SPARK-32698 URL: https://issues.apache.org/jira/browse/SPARK-32698 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang Currently in AQE when coalescing shuffling partitions, {quote}We fall back to Spark default parallelism if the minimum number of coalesced partitions is not set, so to avoid perf regressions compared to no coalescing. {quote} >From our experience, this has resulted in a lot of uncertainty of the number >of tasks after coalescing especially with dynamic allocation, and also lead to >many small output files. It's complex and hard to reason about. Hence, I'm proposing not falling back to the default parallelism but coalescing towards the target size when the minimum number of coalesced partitions is not set. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE
Manu Zhang created SPARK-32083: -- Summary: Unnecessary tasks are launched when input is empty with AQE Key: SPARK-32083 URL: https://issues.apache.org/jira/browse/SPARK-32083 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang [https://github.com/apache/spark/pull/28226] meant to avoid launching unnecessary tasks for 0-size partitions when AQE is enabled. However, when all partitions are empty, the number of partitions will be `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) unnecessary tasks are launched in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31942) Revert SPARK-31864 Adjust AQE skew join trigger condition
Manu Zhang created SPARK-31942: -- Summary: Revert SPARK-31864 Adjust AQE skew join trigger condition Key: SPARK-31942 URL: https://issues.apache.org/jira/browse/SPARK-31942 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang As discussed in [https://github.com/apache/spark/pull/28669#issuecomment-641044531], revert SPARK-31864 for optimizing skew join to work for extremely clustered keys. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join
[ https://issues.apache.org/jira/browse/SPARK-31870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-31870: --- Summary: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join (was: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all) > AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional > shuffle" test has no skew join > - > > Key: SPARK-31870 > URL: https://issues.apache.org/jira/browse/SPARK-31870 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > Due to incorrect configurations of > - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes > - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all
Manu Zhang created SPARK-31870: -- Summary: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all Key: SPARK-31870 URL: https://issues.apache.org/jira/browse/SPARK-31870 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.0.0 Reporter: Manu Zhang Due to incorrect configurations of - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes - spark.sql.adaptive.advisoryPartitionSizeInBytes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30331) The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`
[ https://issues.apache.org/jira/browse/SPARK-30331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-30331: --- Parent: SPARK-31412 Issue Type: Sub-task (was: Bug) > The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true` > --- > > Key: SPARK-30331 > URL: https://issues.apache.org/jira/browse/SPARK-30331 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > > This is due to that the final AdaptiveSparkPlan event is sent out before > {{isFinalPlan}} variable set to `true`. It would fail any listener attempting > to catch the final event by pattern matching `isFinalPlan=true` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31658) SQL UI doesn't show write commands of AQE plan
[ https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-31658: --- Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > SQL UI doesn't show write commands of AQE plan > -- > > Key: SPARK-31658 > URL: https://issues.apache.org/jira/browse/SPARK-31658 > Project: Spark > Issue Type: Sub-task > Components: SQL, Web UI >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Assignee: Manu Zhang >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31658) SQL UI doesn't show write commands of AQE plan
Manu Zhang created SPARK-31658: -- Summary: SQL UI doesn't show write commands of AQE plan Key: SPARK-31658 URL: https://issues.apache.org/jira/browse/SPARK-31658 Project: Spark Issue Type: Improvement Components: SQL, Web UI Affects Versions: 3.0.0 Reporter: Manu Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics
Manu Zhang created SPARK-31646: -- Summary: Remove unused registeredConnections counter from ShuffleMetrics Key: SPARK-31646 URL: https://issues.apache.org/jira/browse/SPARK-31646 Project: Spark Issue Type: Improvement Components: Deploy, Shuffle Affects Versions: 3.0.0 Reporter: Manu Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system
Manu Zhang created SPARK-31611: -- Summary: Register NettyMemoryMetrics into Node Manager's metrics system Key: SPARK-31611 URL: https://issues.apache.org/jira/browse/SPARK-31611 Project: Spark Issue Type: Improvement Components: Shuffle, YARN Affects Versions: 3.0.0 Reporter: Manu Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31219) YarnShuffleService doesn't close idle netty channel
[ https://issues.apache.org/jira/browse/SPARK-31219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-31219: --- Description: Recently, we find our YarnShuffleService has a lot of [half-open connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] where shuffle servers' connections are active while clients have already closed. For example, from server's `ss -nt sport = :7337` output we have {code:java} ESTAB 0 0 server:7337 client:port {code} However, on client `ss -nt dport =: 7337 | grep server` would return nothing. Looking at the code, `YarnShuffleService` creates a `TransportContext` with `closeIdleConnections` set to false. {code:java} public class YarnShuffleService extends AuxiliaryService { ... @Override protected void serviceInit(Configuration conf) throws Exception { ... transportContext = new TransportContext(transportConf, blockHandler); ... } ... } public class TransportContext implements Closeable { ... public TransportContext(TransportConf conf, RpcHandler rpcHandler) { this(conf, rpcHandler, false, false); } public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean closeIdleConnections) { this(conf, rpcHandler, closeIdleConnections, false); } ... }{code} Hence, it's possible the channel may never get closed at server side if the server misses the event that the client has closed it. I find that parameter is true for `ExternalShuffleService`. Is there any reason for the difference here ? Can we enable closeIdleConnections in YarnShuffleService or at least add a configuration to enable it ? was: Recently, we find our YarnShuffleService has a lot of [half-open connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] where shuffle servers' connections are active while clients have already closed. For example, from server's `ss -nt sport = :7337` output we have {code:java} ESTAB 0 0 server:7337 client:port {code} However, on client `ss -nt dport =: 7337 | grep server` would return nothing. Looking at the code, `YarnShuffleService` creates a `TransportContext` with `closeIdleConnections` set to false. {code:java} public class YarnShuffleService extends AuxiliaryService { ... @Override protected void serviceInit(Configuration conf) throws Exception { ... transportContext = new TransportContext(transportConf, blockHandler); ... } ... } public class TransportContext implements Closeable { ... public TransportContext(TransportConf conf, RpcHandler rpcHandler) { this(conf, rpcHandler, false, false); } public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean closeIdleConnections) { this(conf, rpcHandler, closeIdleConnections, false); } ... }{code} Hence, it's possible the channel may never get closed at server side if the server misses the event that the client has closed it. I find that parameter is true for `ExternalShuffleService`. Is there any reason for the difference here ? Will it be valuable to add a configuration to allow enabling closeIdleConnections ? > YarnShuffleService doesn't close idle netty channel > --- > > Key: SPARK-31219 > URL: https://issues.apache.org/jira/browse/SPARK-31219 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 2.4.5, 3.0.0 >Reporter: Manu Zhang >Priority: Major > > Recently, we find our YarnShuffleService has a lot of [half-open > connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] > where shuffle servers' connections are active while clients have already > closed. > For example, from server's `ss -nt sport = :7337` output we have > {code:java} > ESTAB 0 0 server:7337 client:port > {code} > However, on client `ss -nt dport =: 7337 | grep server` would return nothing. > Looking at the code, `YarnShuffleService` creates a `TransportContext` with > `closeIdleConnections` set to false. > {code:java} > public class YarnShuffleService extends AuxiliaryService { > ... > @Override protected void serviceInit(Configuration conf) throws Exception > { > ... > transportContext = new TransportContext(transportConf, blockHandler); > ... > } > ... > } > public class TransportContext implements Closeable { > ... > public TransportContext(TransportConf conf, RpcHandler rpcHandler) { > this(conf, rpcHandler, false, false); > } > public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean > closeIdleConnections) { > this(conf, rpcHandler, closeIdleConnections, false); > } > ... > }{code} > Hence, it's possible the channel may never get closed at
[jira] [Created] (SPARK-31219) YarnShuffleService doesn't close idle netty channel
Manu Zhang created SPARK-31219: -- Summary: YarnShuffleService doesn't close idle netty channel Key: SPARK-31219 URL: https://issues.apache.org/jira/browse/SPARK-31219 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 2.4.5, 3.0.0 Reporter: Manu Zhang Recently, we find our YarnShuffleService has a lot of [half-open connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html] where shuffle servers' connections are active while clients have already closed. For example, from server's `ss -nt sport = :7337` output we have {code:java} ESTAB 0 0 server:7337 client:port {code} However, on client `ss -nt dport =: 7337 | grep server` would return nothing. Looking at the code, `YarnShuffleService` creates a `TransportContext` with `closeIdleConnections` set to false. {code:java} public class YarnShuffleService extends AuxiliaryService { ... @Override protected void serviceInit(Configuration conf) throws Exception { ... transportContext = new TransportContext(transportConf, blockHandler); ... } ... } public class TransportContext implements Closeable { ... public TransportContext(TransportConf conf, RpcHandler rpcHandler) { this(conf, rpcHandler, false, false); } public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean closeIdleConnections) { this(conf, rpcHandler, closeIdleConnections, false); } ... }{code} Hence, it's possible the channel may never get closed at server side if the server misses the event that the client has closed it. I find that parameter is true for `ExternalShuffleService`. Is there any reason for the difference here ? Will it be valuable to add a configuration to allow enabling closeIdleConnections ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31047) Improve file listing for ViewFileSystem
[ https://issues.apache.org/jira/browse/SPARK-31047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-31047: --- Component/s: (was: Input/Output) SQL > Improve file listing for ViewFileSystem > --- > > Key: SPARK-31047 > URL: https://issues.apache.org/jira/browse/SPARK-31047 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Manu Zhang >Priority: Minor > > https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing > for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes > use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. > This ticket intends to improve the case where ViewFileSystem is used to > manage multiple DistributedFileSystems. It has also overridden the > {{listLocatedStatus}} method by delegating to the filesystem it resolves to, > e.g. DistributedFileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31047) Improve file listing for ViewFileSystem
Manu Zhang created SPARK-31047: -- Summary: Improve file listing for ViewFileSystem Key: SPARK-31047 URL: https://issues.apache.org/jira/browse/SPARK-31047 Project: Spark Issue Type: Improvement Components: Input/Output Affects Versions: 3.1.0 Reporter: Manu Zhang https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. This ticket intends to improve the case where ViewFileSystem is used to manage multiple DistributedFileSystems. It has also overridden the {{listLocatedStatus}} method by delegating to the filesystem it resolves to, e.g. DistributedFileSystem. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30331) The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`
Manu Zhang created SPARK-30331: -- Summary: The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true` Key: SPARK-30331 URL: https://issues.apache.org/jira/browse/SPARK-30331 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang This is due to that the final AdaptiveSparkPlan event is sent out before {{isFinalPlan}} variable set to `true`. It would fail any listener attempting to catch the final event by pattern matching `isFinalPlan=true` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30217) Improve partition pruning rule to enforce idempotence
Manu Zhang created SPARK-30217: -- Summary: Improve partition pruning rule to enforce idempotence Key: SPARK-30217 URL: https://issues.apache.org/jira/browse/SPARK-30217 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Manu Zhang The rule was added to blacklist in https://github.com/apache/spark/commit/a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-a636a87d8843eeccca90140be91d4fafR52 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29793) Display plan evolve history in AQE
[ https://issues.apache.org/jira/browse/SPARK-29793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983339#comment-16983339 ] Manu Zhang commented on SPARK-29793: [~maryannxue], Is the {{AdaptiveSparkPlan(isFinal=true)}} graph currently available on Spark UI or printed anywhere ? > Display plan evolve history in AQE > -- > > Key: SPARK-29793 > URL: https://issues.apache.org/jira/browse/SPARK-29793 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Priority: Major > > To have a way to present the entire evolve history of an AQE plan. This can > be done in two stages: > # Enable an option for printing the plan changes on the go. > # Display history plans in Spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983155#comment-16983155 ] Manu Zhang edited comment on SPARK-29792 at 11/27/19 4:05 AM: -- I find most metrics (except for data sources) are missing from SQL execution graph for AQE with subqueries. I suppose this Jira will fix it ? was (Author: mauzhang): I find most metrics are missing from SQL execution graph for AQE with subqueries. I suppose this Jira will fix it ? > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > > After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], > the subqueries info can not be updated in AQE. And this Jira will fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983155#comment-16983155 ] Manu Zhang commented on SPARK-29792: I find most metrics are missing from SQL execution graph for AQE with subqueries. I suppose this Jira will fix it ? > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > > After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], > the subqueries info can not be updated in AQE. And this Jira will fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-29792: --- Attachment: image-2019-11-27-11-46-01-587.png > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > > After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], > the subqueries info can not be updated in AQE. And this Jira will fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manu Zhang updated SPARK-29792: --- Attachment: (was: image-2019-11-27-11-46-01-587.png) > SQL metrics cannot be updated to subqueries in AQE > -- > > Key: SPARK-29792 > URL: https://issues.apache.org/jira/browse/SPARK-29792 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Ke Jia >Priority: Major > > After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], > the subqueries info can not be updated in AQE. And this Jira will fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28531) Improve Extract Python UDFs optimizer rule to enforce idempotence
[ https://issues.apache.org/jira/browse/SPARK-28531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977109#comment-16977109 ] Manu Zhang commented on SPARK-28531: [~manifoldQAQ], I saw the PR was closed. What's the status of this issue ? Any plan to submit a new PR ? > Improve Extract Python UDFs optimizer rule to enforce idempotence > - > > Key: SPARK-28531 > URL: https://issues.apache.org/jira/browse/SPARK-28531 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yesheng Ma >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27689) Error to execute hive views with spark
[ https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897948#comment-16897948 ] Manu Zhang commented on SPARK-27689: [~lambda], [~hyukjin.kwon], [~yumwang], This issue should be fixed by [PR#24960|https://github.com/apache/spark/pull/24960] ([PR#25068|https://github.com/apache/spark/pull/25068] for branch-2.4 and [PR#25293|https://github.com/apache/spark/pull/25293] for 2.3) where analyzing View is deferred to optimizer. Before that View was analyzed first and JOIN would try to deduplicate conflicting attributes (since you are joining on the same id_person column) by *replacing* ids of View's plan in the *right* branch. That's where things went wrong. Take a look at the logical plan under View {code:java} +- View (`schema_p`.`person_product_v`, [id_person#76,id_product#77,country#78,city#79,price#80,start_date#81,end_date#82]) +- !Project [cast(id_person#103 as int) AS id_person#76, cast(id_product#104 as int) AS id_product#77, cast(country#105 as string) AS country#78, cast(city#106 as string) AS city#79, cast(price#107 as decimal(38,8)) AS price#80, cast(start_date#108 as date) AS start_date#81, cast(end_date#109 as date) AS end_date#82] +- Project [cast(id_person#7 as int) AS id_person#0, cast(id_product#8 as int) AS id_product#1, cast(country#9 as string) AS country#2, cast(city#10 as string) AS city#3, cast(price#11 as decimal(38,8)) AS price#4, cast(start_date#12 as date) AS start_date#5, cast(end_date#13 as date) AS end_date#6] {code} The input of the node {{!Project}} (e.g. id_person#103) didn't match output of its child (e.g. id_person#0). That's because id_person#103 was replaced from id_person#0 in the deduplicating process while id_person#0 could not be replaced since it's not an attribute (check out Alias). This can be reproduced with a UT like {code} test("SparkSQL failed to resolve attributes with nested self-joins on hive view") { withTable("hive_table") { withView("hive_view", "temp_view1", "temp_view2") { sql("CREATE TABLE hive_table AS SELECT 1 AS id") sql("CREATE VIEW hive_view AS SELECT id FROM hive_table") sql("CREATE TEMPORARY VIEW temp_view1 AS SELECT id FROM hive_view") sql("CREATE TEMPORARY VIEW temp_view2 AS SELECT a.id " + "FROM temp_view1 AS a JOIN temp_view1 AS b ON a.id = b.id") val df = sql("SELECT c.id FROM temp_view1 AS c JOIN temp_view2 AS d ON c.id = d.id") checkAnswer(df, Row(1)) } } } {code} > Error to execute hive views with spark > -- > > Key: SPARK-27689 > URL: https://issues.apache.org/jira/browse/SPARK-27689 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0, 2.3.3, 2.4.3 >Reporter: Lambda >Priority: Major > > I have a python error when I execute the following code using hive views but > it works correctly when I run it with hive tables. > *Hive databases:* > {code:java} > CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p"; > {code} > *Hive tables:* > {code:java} > CREATE TABLE schema_p.product( > id_product string, > name string, > country string, > city string, > start_date string, > end_date string > ) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION 'hdfs:///tmp/schema_p/product'; > {code} > {code:java} > CREATE TABLE schema_p.person_product( > id_person string, > id_product string, > country string, > city string, > price string, > start_date string, > end_date string > ) > ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' > LOCATION 'hdfs:///tmp/schema_p/person_product'; > {code} > *Hive views:* > {code:java} > CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS > id_product, name AS name, country AS country, city AS city, CAST(start_date > AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM > schema_p.product; > > CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS > id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS > city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS > start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product; > {code} > *Code*: > {code:java} > def read_tables(sc): > in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' } > data_dict = {} > for n, d in in_dict.iteritems(): > data_dict[n] = sc.read.table(d) > return data_dict > def
[jira] [Comment Edited] (SPARK-22063) Upgrade lintr to latest commit sha1 ID
[ https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892595#comment-16892595 ] Manu Zhang edited comment on SPARK-22063 at 7/28/19 4:19 AM: - [~hyukjin.kwon], [~felixcheung], is there any update in this thread ? Which lint-r version is used on Jenkins now ? I also find installing lint-r would upgrade testthat to latest version while [SparkR requires testthat 1.0.2 |https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests] was (Author: mauzhang): Is there any update in this thread ? Which lint-r version is used in build now ? I also find upgrading lint-r would upgrade testthat to latest version while [SparkR requires testthat 1.0.2 |https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests] > Upgrade lintr to latest commit sha1 ID > -- > > Key: SPARK-22063 > URL: https://issues.apache.org/jira/browse/SPARK-22063 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this > pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026]) > and SPARK-14074. > Today, I tried to upgrade the latest, > https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72 > This fixes many bugs and now finds many instances that I have observed and > thought should be caught time to time: > {code} > inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis > in a function call. > return (output) > ^ > R/column.R:241:1: style: Lines should not be more than 100 characters. > #' > \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{ > ^~~~ > R/context.R:332:1: style: Variable and function names should not be longer > than 30 characters. > spark.getSparkFilesRootDirectory <- function() { > ^~~~ > R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters. > #' @param j,select expression for the single Column or a list of columns to > select from the SparkDataFrame. > ^~~ > R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters. > #' @return A new SparkDataFrame containing only the rows that meet the > condition with selected columns. > ^~~ > R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a > function call. > return (joinRes) > ^ > R/DataFrame.R:2652:1: style: Variable and function names should not be longer > than 30 characters. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a > function call. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a > function call. > stop ("The following column name: ", newJoin, " occurs more than once > in the 'DataFrame'.", > ^ > R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters. > #' @note The statistics provided by \code{summary} were change in 2.3.0 use > \link{describe} for previous defaults. > ^~ > R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{cube} creates a single global > aggregate and is equivalent to > ^~~ > R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{rollup} creates a single global > aggregate and is equivalent to > ^ > R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a > function call. > switch (type, > ^ > R/functions.R:41:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{window}, it must be a time Column > of \code{TimestampType}. >
[jira] [Comment Edited] (SPARK-22063) Upgrade lintr to latest commit sha1 ID
[ https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892595#comment-16892595 ] Manu Zhang edited comment on SPARK-22063 at 7/25/19 9:41 AM: - Is there any update in this thread ? Which lint-r version is used in build now ? I also find upgrading lint-r would upgrade testthat to latest version while [SparkR requires testthat 1.0.2 |https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests] was (Author: mauzhang): Is there any update in this thread ? Which lint-r version is used in build now ? I also find upgrading lint-r would also upgrade testthat to latest version while [SparkR requires testthat 1.0.2 |https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests] > Upgrade lintr to latest commit sha1 ID > -- > > Key: SPARK-22063 > URL: https://issues.apache.org/jira/browse/SPARK-22063 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this > pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026]) > and SPARK-14074. > Today, I tried to upgrade the latest, > https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72 > This fixes many bugs and now finds many instances that I have observed and > thought should be caught time to time: > {code} > inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis > in a function call. > return (output) > ^ > R/column.R:241:1: style: Lines should not be more than 100 characters. > #' > \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{ > ^~~~ > R/context.R:332:1: style: Variable and function names should not be longer > than 30 characters. > spark.getSparkFilesRootDirectory <- function() { > ^~~~ > R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters. > #' @param j,select expression for the single Column or a list of columns to > select from the SparkDataFrame. > ^~~ > R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters. > #' @return A new SparkDataFrame containing only the rows that meet the > condition with selected columns. > ^~~ > R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a > function call. > return (joinRes) > ^ > R/DataFrame.R:2652:1: style: Variable and function names should not be longer > than 30 characters. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a > function call. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a > function call. > stop ("The following column name: ", newJoin, " occurs more than once > in the 'DataFrame'.", > ^ > R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters. > #' @note The statistics provided by \code{summary} were change in 2.3.0 use > \link{describe} for previous defaults. > ^~ > R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{cube} creates a single global > aggregate and is equivalent to > ^~~ > R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{rollup} creates a single global > aggregate and is equivalent to > ^ > R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a > function call. > switch (type, > ^ > R/functions.R:41:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{window}, it must be a time Column > of \code{TimestampType}. >
[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID
[ https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892595#comment-16892595 ] Manu Zhang commented on SPARK-22063: Is there any update in this thread ? Which lint-r version is used in build now ? I also find upgrading lint-r would also upgrade testthat to latest version while [SparkR requires testthat 1.0.2 |https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests] > Upgrade lintr to latest commit sha1 ID > -- > > Key: SPARK-22063 > URL: https://issues.apache.org/jira/browse/SPARK-22063 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Minor > > Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this > pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026]) > and SPARK-14074. > Today, I tried to upgrade the latest, > https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72 > This fixes many bugs and now finds many instances that I have observed and > thought should be caught time to time: > {code} > inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis > in a function call. > return (output) > ^ > R/column.R:241:1: style: Lines should not be more than 100 characters. > #' > \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{ > ^~~~ > R/context.R:332:1: style: Variable and function names should not be longer > than 30 characters. > spark.getSparkFilesRootDirectory <- function() { > ^~~~ > R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters. > #' @param j,select expression for the single Column or a list of columns to > select from the SparkDataFrame. > ^~~ > R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters. > #' @return A new SparkDataFrame containing only the rows that meet the > condition with selected columns. > ^~~ > R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a > function call. > return (joinRes) > ^ > R/DataFrame.R:2652:1: style: Variable and function names should not be longer > than 30 characters. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a > function call. > generateAliasesForIntersectedCols <- function (x, intersectedColNames, > suffix) { > ^ > R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a > function call. > stop ("The following column name: ", newJoin, " occurs more than once > in the 'DataFrame'.", > ^ > R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters. > #' @note The statistics provided by \code{summary} were change in 2.3.0 use > \link{describe} for previous defaults. > ^~ > R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{cube} creates a single global > aggregate and is equivalent to > ^~~ > R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters. > #' If grouping expression is missing \code{rollup} creates a single global > aggregate and is equivalent to > ^ > R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a > function call. > switch (type, > ^ > R/functions.R:41:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{window}, it must be a time Column > of \code{TimestampType}. > ^ > R/functions.R:93:1: style: Lines should not be more than 100 characters. > #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and > \code{shiftRightUnsigned}, > ^~~ > R/functions.R:483:52:
[jira] [Commented] (SPARK-26977) Warn against subclassing scala.App doesn't work
[ https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775783#comment-16775783 ] Manu Zhang commented on SPARK-26977: I'd love to > Warn against subclassing scala.App doesn't work > --- > > Key: SPARK-26977 > URL: https://issues.apache.org/jira/browse/SPARK-26977 > Project: Spark > Issue Type: Bug > Components: Spark Submit >Affects Versions: 2.4.0 >Reporter: Manu Zhang >Priority: Minor > > As per discussion in > [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], > the warn against subclassing scala.App doesn't work. For example, > {code:scala} > object Test extends scala.App { >// spark code > } > {code} > Scala will compile {{object Test}} into two Java classes, {{Test}} passed in > by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against > {{Test}} and thus there will be no warn when user's application subclassing > {{scala.App}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26977) Warn against subclassing scala.App doesn't work
Manu Zhang created SPARK-26977: -- Summary: Warn against subclassing scala.App doesn't work Key: SPARK-26977 URL: https://issues.apache.org/jira/browse/SPARK-26977 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.0 Reporter: Manu Zhang As per discussion in [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], the warn against subclassing scala.App doesn't work. For example, {code:scala} object Test extends scala.App { // spark code } {code} Scala will compile {{object Test}} into two Java classes, {{Test}} passed in by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against {{Test}} and thus there will be no warn when user's application subclassing {{scala.App}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features
[ https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616243#comment-16616243 ] Manu Zhang commented on SPARK-15041: Is there a plan to add such strategies as min/max ? > adding mode strategy for ml.feature.Imputer for categorical features > > > Key: SPARK-15041 > URL: https://issues.apache.org/jira/browse/SPARK-15041 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > Adding mode strategy for ml.feature.Imputer for categorical features. This > need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 > From comments of jkbradley and Nick Pentreath in the PR > {quote} > Investigate efficiency of approaches using DataFrame/Dataset and/or approx > approaches such as frequentItems or Count-Min Sketch (will require an update > to CMS to return "heavy-hitters"). > investigate if we can use metadata to only allow mode for categorical > features (or perhaps as an easier alternative, allow mode for only Int/Long > columns) > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver
[ https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165674#comment-16165674 ] Manu Zhang commented on SPARK-12837: [~todd_leo], please check out [executor side broadcast|https://issues.apache.org/jira/browse/SPARK-17556] > Spark driver requires large memory space for serialized results even there > are no data collected to the driver > -- > > Key: SPARK-12837 > URL: https://issues.apache.org/jira/browse/SPARK-12837 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 1.5.2, 1.6.0 >Reporter: Tien-Dung LE >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.0 > > > Executing a sql statement with a large number of partitions requires a high > memory space for the driver even there are no requests to collect data back > to the driver. > Here are steps to re-produce the issue. > 1. Start spark shell with a spark.driver.maxResultSize setting > {code:java} > bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m > {code} > 2. Execute the code > {code:java} > case class Toto( a: Int, b: Int) > val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF > sqlContext.setConf( "spark.sql.shuffle.partitions", "200" ) > df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK > sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString ) > df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile( > "toto2" ) // ERROR > {code} > The error message is > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 393 tasks (1025.9 KB) is bigger than > spark.driver.maxResultSize (1024.0 KB) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org