[jira] [Commented] (SPARK-44719) NoClassDefFoundError when using Hive UDF

2023-08-08 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17752049#comment-17752049
 ] 

Manu Zhang commented on SPARK-44719:


Is there a 2.3.10 release?

> NoClassDefFoundError when using Hive UDF
> 
>
> Key: SPARK-44719
> URL: https://issues.apache.org/jira/browse/SPARK-44719
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
> Attachments: HiveUDFs-1.0-SNAPSHOT.jar
>
>
> How to reproduce:
> {noformat}
> spark-sql (default)> add jar 
> /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar;
> Time taken: 0.413 seconds
> spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 
> 'net.petrabarus.hiveudfs.LongToIP';
> Time taken: 0.038 seconds
> spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10);
> 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT 
> long_to_ip(2130706433L) FROM range(10)]
> java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory
>   at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:348)
> ...
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adds running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Summary: Spark application hangs when YarnAllocator adds running executors 
after processing completed containers  (was: Spark application hangs when 
YarnAllocator adding running executors after processing completed containers)

> Spark application hangs when YarnAllocator adds running executors after 
> processing completed containers
> ---
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Description: 
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator added the running executor after processing completed containers. 
The former happens in a separate thread after executor launch.

YarnAllocator believes there are still running executors, although they are 
already lost due to preemption. Hence, the application hangs without any 
running executors.

  was:
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.

YarnAllocator thought


> Spark application hangs when YarnAllocator adding running executors after 
> processing completed containers
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator added the running executor after processing completed 
> containers. The former happens in a separate thread after executor launch.
> YarnAllocator believes there are still running executors, although they are 
> already lost due to preemption. Hence, the application hangs without any 
> running executors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator adding running executors after processing completed containers

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Summary: Spark application hangs when YarnAllocator adding running 
executors after processing completed containers  (was: Spark application hangs 
when YarnAllocator processing completed containers before updating internal 
state)

> Spark application hangs when YarnAllocator adding running executors after 
> processing completed containers
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator processing completed container before updating internal state 
> and adding the executorId. The latter happens in a separate thread after 
> executor launch.
> YarnAllocator thought



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state

2023-05-15 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-43510:
---
Description: 
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted.{code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.

YarnAllocator thought

  was:
I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted. {code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.


> Spark application hangs when YarnAllocator processing completed containers 
> before updating internal state
> -
>
> Key: SPARK-43510
> URL: https://issues.apache.org/jira/browse/SPARK-43510
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 3.4.0
>Reporter: Manu Zhang
>Priority: Major
>
> I see application hangs when containers are preempted immediately after 
> allocation as follows.
> {code:java}
> 23/05/14 09:11:33 INFO YarnAllocator: Launching container 
> container_e3812_1684033797982_57865_01_000382 on host 
> hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with 
> ID 277 for ResourceProfile Id 0 
> 23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
> container_e3812_1684033797982_57865_01_000382
> 23/05/14 09:11:33 INFO YarnAllocator: Completed container 
> container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
> -102)
> 23/05/14 09:11:33 INFO YarnAllocator: Container 
> container_e3812_1684033797982_57865_01_000382 was preempted.{code}
> Note the warning log where YarnAllocator cannot find executorId for the 
> container when processing completed containers. The only plausible cause is 
> YarnAllocator processing completed container before updating internal state 
> and adding the executorId. The latter happens in a separate thread after 
> executor launch.
> YarnAllocator thought



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43510) Spark application hangs when YarnAllocator processing completed containers before updating internal state

2023-05-15 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-43510:
--

 Summary: Spark application hangs when YarnAllocator processing 
completed containers before updating internal state
 Key: SPARK-43510
 URL: https://issues.apache.org/jira/browse/SPARK-43510
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 3.4.0
Reporter: Manu Zhang


I see application hangs when containers are preempted immediately after 
allocation as follows.
{code:java}
23/05/14 09:11:33 INFO YarnAllocator: Launching container 
container_e3812_1684033797982_57865_01_000382 on host 
hdc42-mcc10-01-0910-4207-015-tess0028.stratus.rno.ebay.com for executor with ID 
277 for ResourceProfile Id 0 
23/05/14 09:11:33 WARN YarnAllocator: Cannot find executorId for container: 
container_e3812_1684033797982_57865_01_000382
23/05/14 09:11:33 INFO YarnAllocator: Completed container 
container_e3812_1684033797982_57865_01_000382 (state: COMPLETE, exit status: 
-102)
23/05/14 09:11:33 INFO YarnAllocator: Container 
container_e3812_1684033797982_57865_01_000382 was preempted. {code}
Note the warning log where YarnAllocator cannot find executorId for the 
container when processing completed containers. The only plausible cause is 
YarnAllocator processing completed container before updating internal state and 
adding the executorId. The latter happens in a separate thread after executor 
launch.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39430) The inconsistent timezone in Spark History Server UI

2023-03-14 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17700074#comment-17700074
 ] 

Manu Zhang commented on SPARK-39430:


We are using Spark 3.1.1.

All the UI pages are in server timezone except for task LaunchTime which is in 
client timezone due to https://issues.apache.org/jira/browse/SPARK-32068

> The inconsistent timezone in Spark History Server UI
> 
>
> Key: SPARK-39430
> URL: https://issues.apache.org/jira/browse/SPARK-39430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Submit, Web UI
>Affects Versions: 3.2.1
>Reporter: Surbhi
>Priority: Major
> Attachments: Screenshot 2022-06-10 at 12.59.36 AM.png, Screenshot 
> 2022-06-10 at 12.59.50 AM.png
>
>
> The spark history server is running in UTC timezone. But we are trying to 
> view history server in IST timezone.
> The history server landing page shows time in IST but Jobs, Stages, Storage, 
> Environment, Executors tabs shows time in UTC.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39672) NotExists subquery failed with conflicting attributes

2022-07-04 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-39672:
--

 Summary: NotExists subquery failed with conflicting attributes
 Key: SPARK-39672
 URL: https://issues.apache.org/jira/browse/SPARK-39672
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.3
Reporter: Manu Zhang


{code:sql}
select * from
(
select v1.a, v1.b, v2.c
from v1
inner join v2
on v1.a=v2.a) t3
where not exists (
  select 1
  from v2
  where t3.a=v2.a and t3.b=v2.b and t3.c=v2.c
){code}
This query throws AnalysisException
{code:java}
org.apache.spark.sql.AnalysisException: Found conflicting attributes a#266 in 
the condition joining outer plan:
  Join Inner, (a#250 = a#266)
:- Project [_1#243 AS a#250, _2#244 AS b#251]
:  +- LocalRelation [_1#243, _2#244, _3#245]
+- Project [_1#259 AS a#266, _3#261 AS c#268]
   +- LocalRelation [_1#259, _2#260, _3#261]and subplan:
  Project [1 AS 1#273, _1#259 AS a#266, _2#260 AS b#267, _3#261 AS c#268#277]
+- LocalRelation [_1#259, _2#260, _3#261] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30696) Wrong result of the combination of from_utc_timestamp and to_utc_timestamp

2022-06-22 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557794#comment-17557794
 ] 

Manu Zhang commented on SPARK-30696:


[~maxgekk], any update on this?

I find another issue in 3.1.1. DST started at 2:00 am, March 13 however the 
following query already counted the timestamp as DST.
{code:java}
select from_utc_timestamp(timestamp'2022-03-13 05:18:29.581', "US/Pacific") 
>> 2022-03-12 22:18:29.581 {code}

> Wrong result of the combination of from_utc_timestamp and to_utc_timestamp
> --
>
> Key: SPARK-30696
> URL: https://issues.apache.org/jira/browse/SPARK-30696
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Max Gekk
>Priority: Major
>
> Applying to_utc_timestamp() to results of from_utc_timestamp() should return 
> the original timestamp in the same time zone. In the range of 100 years, the 
> combination of functions returns wrong results 280 times out of 1753200:
> {code:java}
> scala> val SECS_PER_YEAR = (36525L * 24 * 60 * 60)/100
> SECS_PER_YEAR: Long = 31557600
> scala> val SECS_PER_MINUTE = 60L
> SECS_PER_MINUTE: Long = 60
> scala>  val tz = "America/Los_Angeles"
> tz: String = America/Los_Angeles
> scala> val df = spark.range(-50 * SECS_PER_YEAR, 50 * SECS_PER_YEAR, 30 * 
> SECS_PER_MINUTE)
> df: org.apache.spark.sql.Dataset[Long] = [id: bigint]
> scala> val diff = 
> df.select((to_utc_timestamp(from_utc_timestamp($"id".cast("timestamp"), tz), 
> tz).cast("long") - $"id").as("diff")).filter($"diff" !== 0)
> warning: there was one deprecation warning; re-run with -deprecation for 
> details
> diff: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [diff: bigint]
> scala> diff.count
> res14: Long = 280
> scala> df.count
> res15: Long = 1753200
> {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39344) Only disable bucketing when autoBucketedScan is enabled if bucket columns are not in scan output

2022-05-30 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-39344:
--

 Summary: Only disable bucketing when autoBucketedScan is enabled 
if bucket columns are not in scan output
 Key: SPARK-39344
 URL: https://issues.apache.org/jira/browse/SPARK-39344
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Manu Zhang


Currently, bucketing was disabled when bucket columns are not in scan output 
after https://github.com/apache/spark/pull/27924. It break existing 
applications whose input size is huge by creating too many FilePartitions and 
causing driver hang. And it cannot be switched off. This is to propose merging 
the rule into DisableUnnecessaryBucketedScan.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-39278:
---
Description: 
Before [https://github.com/apache/spark/pull/23698,]

The precedence of configuring Hadoop Filesystems to access is
{code:java}
spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
Afterwards, it's
{code:java}
spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
spark.yarn.access.hadoopFileSystems{code}
When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
are configured with different values, the PR will break backward compatibility 
and cause application failure.

  was:
Before [https://github.com/apache/spark/pull/23698,]

The precedence of configuring Hadoop Filesystems to access is
{code:java}
spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
Afterwards, it's
{code:java}
spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
spark.yarn.access.hadoopFileSystems{code}
When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
are configured with different values, the PR breaks backward compatibility and 
cause application failure.


> Alternative configs of Hadoop Filesystems to access break backward 
> compatibility
> 
>
> Key: SPARK-39278
> URL: https://issues.apache.org/jira/browse/SPARK-39278
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Before [https://github.com/apache/spark/pull/23698,]
> The precedence of configuring Hadoop Filesystems to access is
> {code:java}
> spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
> Afterwards, it's
> {code:java}
> spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
> spark.yarn.access.hadoopFileSystems{code}
> When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
> are configured with different values, the PR will break backward 
> compatibility and cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39278) Alternative configs of Hadoop Filesystems to access break backward compatibility

2022-05-24 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-39278:
--

 Summary: Alternative configs of Hadoop Filesystems to access break 
backward compatibility
 Key: SPARK-39278
 URL: https://issues.apache.org/jira/browse/SPARK-39278
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Manu Zhang


Before [https://github.com/apache/spark/pull/23698,]

The precedence of configuring Hadoop Filesystems to access is
{code:java}
spark.yarn.access.hadoopFileSystems -> spark.yarn.access.namenodes{code}
Afterwards, it's
{code:java}
spark.kerberos.access.hadoopFileSystems -> spark.yarn.access.namenodes -> 
spark.yarn.access.hadoopFileSystems{code}
When both spark.yarn.access.hadoopFileSystems and spark.yarn.access.namenodes 
are configured with different values, the PR breaks backward compatibility and 
cause application failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39238) Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision

2022-05-19 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-39238:
---
Description: 
The result of following SQL is 1.00 while 1. is expected
{code:java}
CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v)
CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v 
FROM t3;

SELECT CAST(1 AS DECIMAL(28, 2))
UNION ALL
SELECT v / v FROM t4; {code}

  was:
The result of following SQL is 1.00 while 1. is expected
{code:java}
CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v 
FROM t3;
SELECT v / v FROM t4;

SELECT CAST(1 AS DECIMAL(28, 2))
UNION ALL
SELECT v / v FROM t4; {code}


> Decimal precision loss from applying WidenSetOperationTypes before 
> DecimalPrecision 
> 
>
> Key: SPARK-39238
> URL: https://issues.apache.org/jira/browse/SPARK-39238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Manu Zhang
>Priority: Major
>
> The result of following SQL is 1.00 while 1. is expected
> {code:java}
> CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v)
> CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v 
> FROM t3;
> SELECT CAST(1 AS DECIMAL(28, 2))
> UNION ALL
> SELECT v / v FROM t4; {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39238) Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision

2022-05-19 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-39238:
---
Description: 
The result of following SQL is 1.00 while 1. is expected
{code:java}
CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v 
FROM t3;
SELECT v / v FROM t4;

SELECT CAST(1 AS DECIMAL(28, 2))
UNION ALL
SELECT v / v FROM t4; {code}

> Decimal precision loss from applying WidenSetOperationTypes before 
> DecimalPrecision 
> 
>
> Key: SPARK-39238
> URL: https://issues.apache.org/jira/browse/SPARK-39238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Manu Zhang
>Priority: Major
>
> The result of following SQL is 1.00 while 1. is expected
> {code:java}
> CREATE OR REPLACE TEMPORARY VIEW t4 AS SELECT CAST(v AS DECIMAL(18, 2)) AS v 
> FROM t3;
> SELECT v / v FROM t4;
> SELECT CAST(1 AS DECIMAL(28, 2))
> UNION ALL
> SELECT v / v FROM t4; {code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39238) Decimal precision loss from applying WidenSetOperationTypes before DecimalPrecision

2022-05-19 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-39238:
--

 Summary: Decimal precision loss from applying 
WidenSetOperationTypes before DecimalPrecision 
 Key: SPARK-39238
 URL: https://issues.apache.org/jira/browse/SPARK-39238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: Manu Zhang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38211) Add SQL migration guide on restoring loose upcast from string

2022-02-14 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-38211:
---
Summary: Add SQL migration guide on restoring loose upcast from string  
(was: Add SQL migration guide on preserving loose upcast )

> Add SQL migration guide on restoring loose upcast from string
> -
>
> Key: SPARK-38211
> URL: https://issues.apache.org/jira/browse/SPARK-38211
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.2.1
>Reporter: Manu Zhang
>Priority: Minor
>
> After SPARK-24586, loose upcasting from string to other types are not allowed 
> by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
> preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38211) Add SQL migration guide on preserving loose upcast

2022-02-14 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-38211:
--

 Summary: Add SQL migration guide on preserving loose upcast 
 Key: SPARK-38211
 URL: https://issues.apache.org/jira/browse/SPARK-38211
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.2.1
Reporter: Manu Zhang


After SPARK-24586, loose upcasting from string to other types are not allowed 
by default. User can still set {{spark.sql.legacy.looseUpcast=true}} to 
preserve old behavior but it's not documented in the SQL migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38207) Add bucketed scan behavior change to migration guide

2022-02-14 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-38207:
--

 Summary: Add bucketed scan behavior change to migration guide
 Key: SPARK-38207
 URL: https://issues.apache.org/jira/browse/SPARK-38207
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 3.1.2
Reporter: Manu Zhang


Default behavior of bucketed scan is changed in 
https://issues.apache.org/jira/browse/SPARK-32859 but it's not mentioned in SQL 
migration guide.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37633) Unwrap cast should skip if downcast failed with ansi enabled

2021-12-13 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-37633:
--

 Summary: Unwrap cast should skip if downcast failed with ansi 
enabled
 Key: SPARK-37633
 URL: https://issues.apache.org/jira/browse/SPARK-37633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.0, 3.1.2, 3.0.3
Reporter: Manu Zhang


Currently, unwrap cast throws ArithmeticException if down cast failed with ansi 
enabled. Since UnwrapCastInBinaryComparison is an optimizer rule, we should 
always skip on failure regardless of ansi config.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34714) collect_list(struct()) fails when used with GROUP BY

2021-11-29 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17450236#comment-17450236
 ] 

Manu Zhang commented on SPARK-34714:


FYI, this was resolved by https://issues.apache.org/jira/browse/SPARK-34713 and 
https://issues.apache.org/jira/browse/SPARK-34749

 

> collect_list(struct()) fails when used with GROUP BY
> 
>
> Key: SPARK-34714
> URL: https://issues.apache.org/jira/browse/SPARK-34714
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Databricks Runtime 8.0
>Reporter: Lauri Koobas
>Priority: Major
> Fix For: 3.1.2
>
>
> The following is failing in DBR8.0 / Spark 3.1.1, but works in earlier DBR 
> and Spark versions:
> {quote}with step_1 as (
>     select 'E' as name, named_struct('subfield', 1) as field_1
> )
> select name, collect_list(struct(field_1.subfield))
> from step_1
> group by 1
> {quote}
> Fails with the following error message:
> {quote}AnalysisException: cannot resolve 
> 'struct(step_1.`field_1`.`subfield`)' due to data type mismatch: Only 
> foldable string expressions are allowed to appear at odd position, got: 
> NamePlaceholder
> {quote}
> If you modify the query in any of the following ways then it still works::
>  * if you remove the field "name" and the "group by 1" part of the query
>  * if you remove the "struct()" from within the collect_list()
>  * if you use "named_struct()" instead of "struct()" within the collect_list()
> Similarly collect_set() is broken and possibly more related functions, but I 
> haven't done thorough testing.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-11-17 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17445558#comment-17445558
 ] 

Manu Zhang commented on SPARK-31646:


Only very shortly and by a small amount which I think is more like collecting 
delay.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-27 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420688#comment-17420688
 ] 

Manu Zhang commented on SPARK-31646:


{quote}So I will try to derive numBackLoggedConnections outside at the metrics 
monitoring system. Any better suggestion?
{quote}
This looks good.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-17 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656
 ] 

Manu Zhang commented on SPARK-31646:


[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in 

ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-17 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416656#comment-17416656
 ] 

Manu Zhang edited comment on SPARK-31646 at 9/17/21, 12:40 PM:
---

[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.


was (Author: mauzhang):
[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in 

ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-16 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17416147#comment-17416147
 ] 

Manu Zhang commented on SPARK-31646:


[~yzhangal], please check 
[https://github.com/apache/spark/blob/master/common/network-common/src/main/java/org/apache/spark/network/server/TransportChannelHandler.java#L190]



 

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token

2021-05-06 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-35160:
---
Description: 
Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

I'd propose to fail immediately like HadoopFSDelegationTokenProvider.

 

Update:

After [https://github.com/apache/spark/pull/23418], 
HadoopFSDelegationTokenProvider no longer fail on non fatal exception. However, 
the author changed the behavior just to keep it consistent with other 
providers. 

  was:
Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

I'd propose to fail immediately like HadoopFSDelegationTokenProvider.


> Spark application submitted despite failing to get Hive delegation token
> 
>
> Key: SPARK-35160
> URL: https://issues.apache.org/jira/browse/SPARK-35160
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.1.1
>Reporter: Manu Zhang
>Priority: Major
>
> Currently, when running on YARN and failing to get Hive delegation token, a 
> Spark SQL application will still be submitted. Eventually, the application 
> will fail on connecting to Hive metastore without a valid delegation token. 
> Is there any reason for this design ?
> cc [~jerryshao] who originally implemented this in 
> https://issues.apache.org/jira/browse/SPARK-14743
> I'd propose to fail immediately like HadoopFSDelegationTokenProvider.
>  
> Update:
> After [https://github.com/apache/spark/pull/23418], 
> HadoopFSDelegationTokenProvider no longer fail on non fatal exception. 
> However, the author changed the behavior just to keep it consistent with 
> other providers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2021-04-27 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17334443#comment-17334443
 ] 

Manu Zhang commented on SPARK-32083:


This issue has been fixed in https://issues.apache.org/jira/browse/SPARK-35239

> Unnecessary tasks are launched when input is empty with AQE
> ---
>
> Key: SPARK-32083
> URL: https://issues.apache.org/jira/browse/SPARK-32083
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Wenchen Fan
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>
> [https://github.com/apache/spark/pull/28226] meant to avoid launching 
> unnecessary tasks for 0-size partitions when AQE is enabled. However, when 
> all partitions are empty, the number of partitions will be 
> `spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
> unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token

2021-04-23 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17331114#comment-17331114
 ] 

Manu Zhang commented on SPARK-35160:


[~hyukjin.kwon],

Thanks for reminder. I've added my proposal and I did ask about it on mailing 
list. It will be great if you know the reasoning behind it or you may forward 
to someone who knows.

> Spark application submitted despite failing to get Hive delegation token
> 
>
> Key: SPARK-35160
> URL: https://issues.apache.org/jira/browse/SPARK-35160
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.1.1
>Reporter: Manu Zhang
>Priority: Major
>
> Currently, when running on YARN and failing to get Hive delegation token, a 
> Spark SQL application will still be submitted. Eventually, the application 
> will fail on connecting to Hive metastore without a valid delegation token. 
> Is there any reason for this design ?
> cc [~jerryshao] who originally implemented this in 
> https://issues.apache.org/jira/browse/SPARK-14743
> I'd propose to fail immediately like HadoopFSDelegationTokenProvider.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token

2021-04-23 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-35160:
---
Description: 
Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

I'd propose to fail immediately like HadoopFSDelegationTokenProvider.

  was:
Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

 


> Spark application submitted despite failing to get Hive delegation token
> 
>
> Key: SPARK-35160
> URL: https://issues.apache.org/jira/browse/SPARK-35160
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.1.1
>Reporter: Manu Zhang
>Priority: Major
>
> Currently, when running on YARN and failing to get Hive delegation token, a 
> Spark SQL application will still be submitted. Eventually, the application 
> will fail on connecting to Hive metastore without a valid delegation token. 
> Is there any reason for this design ?
> cc [~jerryshao] who originally implemented this in 
> https://issues.apache.org/jira/browse/SPARK-14743
> I'd propose to fail immediately like HadoopFSDelegationTokenProvider.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token

2021-04-20 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-35160:
--

 Summary: Spark application submitted despite failing to get Hive 
delegation token
 Key: SPARK-35160
 URL: https://issues.apache.org/jira/browse/SPARK-35160
 Project: Spark
  Issue Type: Improvement
  Components: Security
Affects Versions: 3.1.1
Reporter: Manu Zhang


Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2021-02-05 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang resolved SPARK-32698.

Resolution: Won't Do

> Do not fall back to default parallelism if the minimum number of coalesced 
> partitions is not set in AQE
> ---
>
> Key: SPARK-32698
> URL: https://issues.apache.org/jira/browse/SPARK-32698
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Currently in AQE when coalescing shuffling partitions,
> {quote}We fall back to Spark default parallelism if the minimum number of 
> coalesced partitions is not set, so to avoid perf regressions compared to no 
> coalescing.
> {quote}
> From our experience, this has resulted in a lot of uncertainty of the number 
> of tasks after coalescing especially with dynamic allocation, and also lead 
> to many small output files. It's complex and hard to reason about.
> Hence, I'm proposing not falling back to the default parallelism but 
> coalescing towards the target size when the minimum number of coalesced 
> partitions is not set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-22 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Description: 
With AQE, local shuffle reader breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
-Coalescing shuffle partitions could also break it.-

  was:
With AQE, local shuffle reader breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.


> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Major
>
> With AQE, local shuffle reader breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> -Coalescing shuffle partitions could also break it.-



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-20 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Priority: Major  (was: Minor)

> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Major
>
> With AQE, local shuffle reader breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-18 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Description: 
With AQE, local shuffle reader breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.

  was:
With AQE, local reader optimizer breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.


> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> With AQE, local shuffle reader breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32932) AQE local shuffle reader breaks repartitioning for dynamic partition overwrite

2020-09-18 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32932:
---
Summary: AQE local shuffle reader breaks repartitioning for dynamic 
partition overwrite  (was: AQE local reader optimizer breaks repartitioning for 
dynamic partition overwrite)

> AQE local shuffle reader breaks repartitioning for dynamic partition overwrite
> --
>
> Key: SPARK-32932
> URL: https://issues.apache.org/jira/browse/SPARK-32932
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> With AQE, local reader optimizer breaks users' repartitioning for dynamic 
> partition overwrite as in the following case.
> {code:java}
> test("repartition with local reader") {
>   withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
> PartitionOverwriteMode.DYNAMIC.toString,
> SQLConf.SHUFFLE_PARTITIONS.key -> "5",
> SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
> withTable("t") {
>   val data = for (
> i <- 1 to 10;
> j <- 1 to 3
>   ) yield (i, j)
>   data.toDF("a", "b")
> .repartition($"b")
> .write
> .partitionBy("b")
> .mode("overwrite")
> .saveAsTable("t")
>   assert(spark.read.table("t").inputFiles.length == 3)
> }
>   }
> }{code}
> Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32932) AQE local reader optimizer breaks repartitioning for dynamic partition overwrite

2020-09-17 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-32932:
--

 Summary: AQE local reader optimizer breaks repartitioning for 
dynamic partition overwrite
 Key: SPARK-32932
 URL: https://issues.apache.org/jira/browse/SPARK-32932
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


With AQE, local reader optimizer breaks users' repartitioning for dynamic 
partition overwrite as in the following case.
{code:java}
test("repartition with local reader") {
  withSQLConf(SQLConf.PARTITION_OVERWRITE_MODE.key -> 
PartitionOverwriteMode.DYNAMIC.toString,
SQLConf.SHUFFLE_PARTITIONS.key -> "5",
SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
withTable("t") {
  val data = for (
i <- 1 to 10;
j <- 1 to 3
  ) yield (i, j)
  data.toDF("a", "b")
.repartition($"b")
.write
.partitionBy("b")
.mode("overwrite")
.saveAsTable("t")
  assert(spark.read.table("t").inputFiles.length == 3)
}
  }
}{code}
Coalescing shuffle partitions could also break it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-09-03 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32753:
---
Priority: Major  (was: Minor)

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Major
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-08-31 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-32753:
---
Description: 
To reproduce:
{code:java}
spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
val df = spark.sql("select id from v1 group by id distribute by id") 
println(df.collect().toArray.mkString(","))
println(df.queryExecution.executedPlan)

// With AQE
[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
AdaptiveSparkPlan(isFinalPlan=true)
+- CustomShuffleReader local
   +- ShuffleQueryStage 0
  +- Exchange hashpartitioning(id#183L, 10), true
 +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
+- Union
   :- *(1) Range (0, 10, step=1, splits=2)
   +- *(2) Range (0, 10, step=1, splits=2)

// Without AQE
[4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
*(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
+- Exchange hashpartitioning(id#206L, 10), true
   +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
  +- Union
 :- *(1) Range (0, 10, step=1, splits=2)
 +- *(2) Range (0, 10, step=1, splits=2){code}

  was:
To reproduce:
spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
val df = spark.sql("select id from v1 group by id distribute by id") 
println(df.collect().toArray.mkString(","))
println(df.queryExecution.executedPlan)// With 
AQE[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
AdaptiveSparkPlan(isFinalPlan=true)
+- CustomShuffleReader local
   +- ShuffleQueryStage 0
  +- Exchange hashpartitioning(id#183L, 10), true +- *(3) 
HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
+- Union
   :- *(1) Range (0, 10, step=1, splits=2)
   +- *(2) Range (0, 10, step=1, splits=2)// Without 
AQE[4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
*(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
+- Exchange hashpartitioning(id#206L, 10), true   +- *(3) 
HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
  +- Union
 :- *(1) Range (0, 10, step=1, splits=2)
 +- *(2) Range (0, 10, step=1, splits=2)


> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-08-31 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-32753:
--

 Summary: Deduplicating and repartitioning the same column create 
duplicate rows with AQE
 Key: SPARK-32753
 URL: https://issues.apache.org/jira/browse/SPARK-32753
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


To reproduce:
spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
val df = spark.sql("select id from v1 group by id distribute by id") 
println(df.collect().toArray.mkString(","))
println(df.queryExecution.executedPlan)// With 
AQE[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
AdaptiveSparkPlan(isFinalPlan=true)
+- CustomShuffleReader local
   +- ShuffleQueryStage 0
  +- Exchange hashpartitioning(id#183L, 10), true +- *(3) 
HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
+- Union
   :- *(1) Range (0, 10, step=1, splits=2)
   +- *(2) Range (0, 10, step=1, splits=2)// Without 
AQE[4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
*(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
+- Exchange hashpartitioning(id#206L, 10), true   +- *(3) 
HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
  +- Union
 :- *(1) Range (0, 10, step=1, splits=2)
 +- *(2) Range (0, 10, step=1, splits=2)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32698) Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE

2020-08-25 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-32698:
--

 Summary: Do not fall back to default parallelism if the minimum 
number of coalesced partitions is not set in AQE
 Key: SPARK-32698
 URL: https://issues.apache.org/jira/browse/SPARK-32698
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


Currently in AQE when coalescing shuffling partitions,
{quote}We fall back to Spark default parallelism if the minimum number of 
coalesced partitions is not set, so to avoid perf regressions compared to no 
coalescing.
{quote}
>From our experience, this has resulted in a lot of uncertainty of the number 
>of tasks after coalescing especially with dynamic allocation, and also lead to 
>many small output files. It's complex and hard to reason about.

Hence, I'm proposing not falling back to the default parallelism but coalescing 
towards the target size when the minimum number of coalesced partitions is not 
set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32083) Unnecessary tasks are launched when input is empty with AQE

2020-06-24 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-32083:
--

 Summary: Unnecessary tasks are launched when input is empty with 
AQE
 Key: SPARK-32083
 URL: https://issues.apache.org/jira/browse/SPARK-32083
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


[https://github.com/apache/spark/pull/28226] meant to avoid launching 
unnecessary tasks for 0-size partitions when AQE is enabled. However, when all 
partitions are empty, the number of partitions will be 
`spark.sql.adaptive.coalescePartitions.initialPartitionNum` and (a lot of) 
unnecessary tasks are launched in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31942) Revert SPARK-31864 Adjust AQE skew join trigger condition

2020-06-09 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31942:
--

 Summary: Revert SPARK-31864 Adjust AQE skew join trigger condition
 Key: SPARK-31942
 URL: https://issues.apache.org/jira/browse/SPARK-31942
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


As discussed in 
[https://github.com/apache/spark/pull/28669#issuecomment-641044531], revert 
SPARK-31864 for optimizing skew join to work for extremely clustered keys.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test has no skew join

2020-05-30 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-31870:
---
Summary: AdaptiveQueryExecSuite: "Do not optimize skew join if introduce 
additional shuffle" test has no skew join  (was: AdaptiveQueryExecSuite: "Do 
not optimize skew join if introduce additional shuffle" test doesn't optimize 
skew join at all)

> AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional 
> shuffle" test has no skew join
> -
>
> Key: SPARK-31870
> URL: https://issues.apache.org/jira/browse/SPARK-31870
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> Due to incorrect configurations of 
> - spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes
> - spark.sql.adaptive.advisoryPartitionSizeInBytes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31870) AdaptiveQueryExecSuite: "Do not optimize skew join if introduce additional shuffle" test doesn't optimize skew join at all

2020-05-30 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31870:
--

 Summary: AdaptiveQueryExecSuite: "Do not optimize skew join if 
introduce additional shuffle" test doesn't optimize skew join at all
 Key: SPARK-31870
 URL: https://issues.apache.org/jira/browse/SPARK-31870
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.0.0
Reporter: Manu Zhang


Due to incorrect configurations of 

- spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes

- spark.sql.adaptive.advisoryPartitionSizeInBytes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-30331) The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`

2020-05-11 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-30331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-30331:
---
Parent: SPARK-31412
Issue Type: Sub-task  (was: Bug)

> The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`
> ---
>
> Key: SPARK-30331
> URL: https://issues.apache.org/jira/browse/SPARK-30331
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>
> This is due to that the final AdaptiveSparkPlan event is sent out before 
> {{isFinalPlan}} variable set to `true`. It would fail any listener attempting 
> to catch the final event by pattern matching `isFinalPlan=true`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-11 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-31658:
---
Parent: SPARK-31412
Issue Type: Sub-task  (was: Improvement)

> SQL UI doesn't show write commands of AQE plan
> --
>
> Key: SPARK-31658
> URL: https://issues.apache.org/jira/browse/SPARK-31658
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31658) SQL UI doesn't show write commands of AQE plan

2020-05-07 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31658:
--

 Summary: SQL UI doesn't show write commands of AQE plan
 Key: SPARK-31658
 URL: https://issues.apache.org/jira/browse/SPARK-31658
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Web UI
Affects Versions: 3.0.0
Reporter: Manu Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2020-05-05 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31646:
--

 Summary: Remove unused registeredConnections counter from 
ShuffleMetrics
 Key: SPARK-31646
 URL: https://issues.apache.org/jira/browse/SPARK-31646
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Shuffle
Affects Versions: 3.0.0
Reporter: Manu Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31611) Register NettyMemoryMetrics into Node Manager's metrics system

2020-04-29 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31611:
--

 Summary: Register NettyMemoryMetrics into Node Manager's metrics 
system
 Key: SPARK-31611
 URL: https://issues.apache.org/jira/browse/SPARK-31611
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, YARN
Affects Versions: 3.0.0
Reporter: Manu Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31219) YarnShuffleService doesn't close idle netty channel

2020-03-23 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-31219:
---
Description: 
Recently, we find our YarnShuffleService has a lot of [half-open 
connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html]
 where shuffle servers' connections are active while clients have already 
closed. 

For example, from server's `ss -nt sport = :7337` output we have
{code:java}
ESTAB 0 0 server:7337 client:port

{code}
However, on client `ss -nt dport =: 7337 | grep server` would return nothing.

Looking at the code,  `YarnShuffleService` creates a `TransportContext` with 
`closeIdleConnections` set to false.
{code:java}
public class YarnShuffleService extends AuxiliaryService {
  ...
  @Override  protected void serviceInit(Configuration conf) throws Exception { 
... 
transportContext = new TransportContext(transportConf, blockHandler); 
...
  }
  ...
}

public class TransportContext implements Closeable {
  ...

  public TransportContext(TransportConf conf, RpcHandler rpcHandler) {   
this(conf, rpcHandler, false, false);  
  }
  public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean 
closeIdleConnections) {
this(conf, rpcHandler, closeIdleConnections, false);  
  }
  ...
}{code}
Hence, it's possible the channel  may never get closed at server side if the 
server misses the event that the client has closed it.

I find that parameter is true for `ExternalShuffleService`.

Is there any reason for the difference here ?  Can we enable 
closeIdleConnections in YarnShuffleService or at least add a configuration to 
enable it ?

 

  was:
Recently, we find our YarnShuffleService has a lot of [half-open 
connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html]
 where shuffle servers' connections are active while clients have already 
closed. 

For example, from server's `ss -nt sport = :7337` output we have
{code:java}
ESTAB 0 0 server:7337 client:port

{code}
However, on client `ss -nt dport =: 7337 | grep server` would return nothing.

Looking at the code,  `YarnShuffleService` creates a `TransportContext` with 
`closeIdleConnections` set to false.
{code:java}
public class YarnShuffleService extends AuxiliaryService {
  ...
  @Override  protected void serviceInit(Configuration conf) throws Exception { 
... 
transportContext = new TransportContext(transportConf, blockHandler); 
...
  }
  ...
}

public class TransportContext implements Closeable {
  ...

  public TransportContext(TransportConf conf, RpcHandler rpcHandler) {   
this(conf, rpcHandler, false, false);  
  }
  public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean 
closeIdleConnections) {
this(conf, rpcHandler, closeIdleConnections, false);  
  }
  ...
}{code}
Hence, it's possible the channel  may never get closed at server side if the 
server misses the event that the client has closed it.

I find that parameter is true for `ExternalShuffleService`.

Is there any reason for the difference here ? Will it be valuable to add a 
configuration to allow enabling closeIdleConnections ?

 


> YarnShuffleService doesn't close idle netty channel
> ---
>
> Key: SPARK-31219
> URL: https://issues.apache.org/jira/browse/SPARK-31219
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Manu Zhang
>Priority: Major
>
> Recently, we find our YarnShuffleService has a lot of [half-open 
> connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html]
>  where shuffle servers' connections are active while clients have already 
> closed. 
> For example, from server's `ss -nt sport = :7337` output we have
> {code:java}
> ESTAB 0 0 server:7337 client:port
> {code}
> However, on client `ss -nt dport =: 7337 | grep server` would return nothing.
> Looking at the code,  `YarnShuffleService` creates a `TransportContext` with 
> `closeIdleConnections` set to false.
> {code:java}
> public class YarnShuffleService extends AuxiliaryService {
>   ...
>   @Override  protected void serviceInit(Configuration conf) throws Exception 
> { 
> ... 
> transportContext = new TransportContext(transportConf, blockHandler); 
> ...
>   }
>   ...
> }
> public class TransportContext implements Closeable {
>   ...
>   public TransportContext(TransportConf conf, RpcHandler rpcHandler) {   
> this(conf, rpcHandler, false, false);  
>   }
>   public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean 
> closeIdleConnections) {
> this(conf, rpcHandler, closeIdleConnections, false);  
>   }
>   ...
> }{code}
> Hence, it's possible the channel  may never get closed at 

[jira] [Created] (SPARK-31219) YarnShuffleService doesn't close idle netty channel

2020-03-22 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31219:
--

 Summary: YarnShuffleService doesn't close idle netty channel
 Key: SPARK-31219
 URL: https://issues.apache.org/jira/browse/SPARK-31219
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 2.4.5, 3.0.0
Reporter: Manu Zhang


Recently, we find our YarnShuffleService has a lot of [half-open 
connections|https://blog.stephencleary.com/2009/05/detection-of-half-open-dropped.html]
 where shuffle servers' connections are active while clients have already 
closed. 

For example, from server's `ss -nt sport = :7337` output we have
{code:java}
ESTAB 0 0 server:7337 client:port

{code}
However, on client `ss -nt dport =: 7337 | grep server` would return nothing.

Looking at the code,  `YarnShuffleService` creates a `TransportContext` with 
`closeIdleConnections` set to false.
{code:java}
public class YarnShuffleService extends AuxiliaryService {
  ...
  @Override  protected void serviceInit(Configuration conf) throws Exception { 
... 
transportContext = new TransportContext(transportConf, blockHandler); 
...
  }
  ...
}

public class TransportContext implements Closeable {
  ...

  public TransportContext(TransportConf conf, RpcHandler rpcHandler) {   
this(conf, rpcHandler, false, false);  
  }
  public TransportContext(TransportConf conf, RpcHandler rpcHandler, boolean 
closeIdleConnections) {
this(conf, rpcHandler, closeIdleConnections, false);  
  }
  ...
}{code}
Hence, it's possible the channel  may never get closed at server side if the 
server misses the event that the client has closed it.

I find that parameter is true for `ExternalShuffleService`.

Is there any reason for the difference here ? Will it be valuable to add a 
configuration to allow enabling closeIdleConnections ?

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31047) Improve file listing for ViewFileSystem

2020-03-04 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-31047:
---
Component/s: (was: Input/Output)
 SQL

> Improve file listing for ViewFileSystem
> ---
>
> Key: SPARK-31047
> URL: https://issues.apache.org/jira/browse/SPARK-31047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Manu Zhang
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing 
> for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes 
> use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. 
> This ticket intends to improve the case where ViewFileSystem is used to 
> manage multiple DistributedFileSystems. It has also overridden the 
> {{listLocatedStatus}} method by delegating to the filesystem it resolves to, 
> e.g. DistributedFileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31047) Improve file listing for ViewFileSystem

2020-03-04 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-31047:
--

 Summary: Improve file listing for ViewFileSystem
 Key: SPARK-31047
 URL: https://issues.apache.org/jira/browse/SPARK-31047
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output
Affects Versions: 3.1.0
Reporter: Manu Zhang


https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing for 
DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes use of 
DistributedFileSystem's one single {{listLocatedStatus}} to namenode. This 
ticket intends to improve the case where ViewFileSystem is used to manage 
multiple DistributedFileSystems. It has also overridden the 
{{listLocatedStatus}} method by delegating to the filesystem it resolves to, 
e.g. DistributedFileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30331) The final AdaptiveSparkPlan event is not marked with `isFinalPlan=true`

2019-12-23 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-30331:
--

 Summary: The final AdaptiveSparkPlan event is not marked with 
`isFinalPlan=true`
 Key: SPARK-30331
 URL: https://issues.apache.org/jira/browse/SPARK-30331
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


This is due to that the final AdaptiveSparkPlan event is sent out before 
{{isFinalPlan}} variable set to `true`. It would fail any listener attempting 
to catch the final event by pattern matching `isFinalPlan=true`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30217) Improve partition pruning rule to enforce idempotence

2019-12-11 Thread Manu Zhang (Jira)
Manu Zhang created SPARK-30217:
--

 Summary: Improve partition pruning rule to enforce idempotence
 Key: SPARK-30217
 URL: https://issues.apache.org/jira/browse/SPARK-30217
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Manu Zhang


The rule was added to blacklist in 
https://github.com/apache/spark/commit/a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-a636a87d8843eeccca90140be91d4fafR52



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29793) Display plan evolve history in AQE

2019-11-27 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983339#comment-16983339
 ] 

Manu Zhang commented on SPARK-29793:


[~maryannxue],

Is the {{AdaptiveSparkPlan(isFinal=true)}} graph currently available on Spark 
UI or printed anywhere ?

> Display plan evolve history in AQE
> --
>
> Key: SPARK-29793
> URL: https://issues.apache.org/jira/browse/SPARK-29793
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Priority: Major
>
> To have a way to present the entire evolve history of an AQE plan. This can 
> be done in two stages:
>  # Enable an option for printing the plan changes on the go.
>  # Display history plans in Spark UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-26 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983155#comment-16983155
 ] 

Manu Zhang edited comment on SPARK-29792 at 11/27/19 4:05 AM:
--

I find most metrics (except for data sources) are missing from SQL execution 
graph for AQE with subqueries. I suppose this Jira will fix it ?


was (Author: mauzhang):
I find most metrics are missing from SQL execution graph for AQE with 
subqueries. I suppose this Jira will fix it ?

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>
> After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], 
> the subqueries info can not be updated in AQE. And this Jira will fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-26 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983155#comment-16983155
 ] 

Manu Zhang commented on SPARK-29792:


I find most metrics are missing from SQL execution graph for AQE with 
subqueries. I suppose this Jira will fix it ?

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>
> After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], 
> the subqueries info can not be updated in AQE. And this Jira will fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-26 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-29792:
---
Attachment: image-2019-11-27-11-46-01-587.png

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>
> After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], 
> the subqueries info can not be updated in AQE. And this Jira will fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29792) SQL metrics cannot be updated to subqueries in AQE

2019-11-26 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-29792:
---
Attachment: (was: image-2019-11-27-11-46-01-587.png)

> SQL metrics cannot be updated to subqueries in AQE
> --
>
> Key: SPARK-29792
> URL: https://issues.apache.org/jira/browse/SPARK-29792
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Ke Jia
>Priority: Major
>
> After merged [SPARK-28583|https://issues.apache.org/jira/browse/SPARK-28583], 
> the subqueries info can not be updated in AQE. And this Jira will fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28531) Improve Extract Python UDFs optimizer rule to enforce idempotence

2019-11-18 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16977109#comment-16977109
 ] 

Manu Zhang commented on SPARK-28531:


[~manifoldQAQ],

I saw the PR was closed. What's the status of this issue ? Any plan to submit a 
new PR ?

> Improve Extract Python UDFs optimizer rule to enforce idempotence
> -
>
> Key: SPARK-28531
> URL: https://issues.apache.org/jira/browse/SPARK-28531
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yesheng Ma
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27689) Error to execute hive views with spark

2019-08-01 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897948#comment-16897948
 ] 

Manu Zhang commented on SPARK-27689:


[~lambda], [~hyukjin.kwon], [~yumwang],

This issue should be fixed by 
[PR#24960|https://github.com/apache/spark/pull/24960] 
([PR#25068|https://github.com/apache/spark/pull/25068] for branch-2.4 and 
[PR#25293|https://github.com/apache/spark/pull/25293] for 2.3) where analyzing 
View is deferred to optimizer. 

Before that View was analyzed first and JOIN would try to deduplicate 
conflicting attributes (since you are joining on the same id_person column) by 
*replacing* ids of View's plan in the *right* branch. That's where things went 
wrong. Take a look at the logical plan under View
{code:java}
+- View (`schema_p`.`person_product_v`, 
[id_person#76,id_product#77,country#78,city#79,price#80,start_date#81,end_date#82])
  +- !Project [cast(id_person#103 as int) AS id_person#76, cast(id_product#104 
as int) AS id_product#77, cast(country#105 as string) AS country#78, 
cast(city#106 as string) AS city#79, cast(price#107 as decimal(38,8)) AS 
price#80, cast(start_date#108 as date) AS start_date#81, cast(end_date#109 as 
date) AS end_date#82]
+- Project [cast(id_person#7 as int) AS id_person#0, cast(id_product#8 as 
int) AS id_product#1, cast(country#9 as string) AS country#2, cast(city#10 as 
string) AS city#3, cast(price#11 as decimal(38,8)) AS price#4, 
cast(start_date#12 as date) AS start_date#5, cast(end_date#13 as date) AS 
end_date#6]
{code}
The input of the node {{!Project}} (e.g. id_person#103) didn't match output of 
its child (e.g. id_person#0). That's because id_person#103 was replaced from 
id_person#0 in the deduplicating process while id_person#0 could not be 
replaced since it's not an attribute (check out Alias).
  
 This can be reproduced with a UT like 
{code}
test("SparkSQL failed to resolve attributes with nested self-joins on hive 
view") { 
  withTable("hive_table") { 
withView("hive_view", "temp_view1", "temp_view2") { 
   sql("CREATE TABLE hive_table AS SELECT 1 AS id") 
   sql("CREATE VIEW hive_view AS SELECT id FROM hive_table") 
   sql("CREATE TEMPORARY VIEW temp_view1 AS SELECT id FROM hive_view") 
   sql("CREATE TEMPORARY VIEW temp_view2 AS SELECT a.id " + 
  "FROM temp_view1 AS a JOIN temp_view1 AS b ON a.id = b.id") 
   val df = sql("SELECT c.id FROM temp_view1 AS c JOIN temp_view2 AS d ON 
c.id = d.id") 
   checkAnswer(df, Row(1)) 
} 
  }
}
{code}

> Error to execute hive views with spark
> --
>
> Key: SPARK-27689
> URL: https://issues.apache.org/jira/browse/SPARK-27689
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.3, 2.4.3
>Reporter: Lambda
>Priority: Major
>
> I have a python error when I execute the following code using hive views but 
> it works correctly when I run it with hive tables.
> *Hive databases:*
> {code:java}
> CREATE DATABASE schema_p LOCATION "hdfs:///tmp/schema_p";
> {code}
> *Hive tables:*
> {code:java}
> CREATE TABLE schema_p.product(
>  id_product string,
>  name string,
>  country string,
>  city string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/product';
> {code}
> {code:java}
> CREATE TABLE schema_p.person_product(
>  id_person string,
>  id_product string,
>  country string,
>  city string,
>  price string,
>  start_date string,
>  end_date string
>  )
>  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' 
>  STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
>  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
>  LOCATION 'hdfs:///tmp/schema_p/person_product';
> {code}
> *Hive views:*
> {code:java}
> CREATE VIEW schema_p.product_v AS SELECT CAST(id_product AS INT) AS 
> id_product, name AS name, country AS country, city AS city, CAST(start_date 
> AS DATE) AS start_date, CAST(end_date AS DATE) AS end_date FROM 
> schema_p.product;
>  
> CREATE VIEW schema_p.person_product_v AS SELECT CAST(id_person AS INT) AS 
> id_person, CAST(id_product AS INT) AS id_product, country AS country, city AS 
> city, CAST(price AS DECIMAL(38,8)) AS price, CAST(start_date AS DATE) AS 
> start_date, CAST(end_date AS DATE) AS end_date FROM schema_p.person_product;
> {code}
> *Code*:
> {code:java}
> def read_tables(sc):
>   in_dict = { 'product': 'product_v', 'person_product': 'person_product_v' }
>   data_dict = {}
>   for n, d in in_dict.iteritems():
> data_dict[n] = sc.read.table(d)
>   return data_dict
> def 

[jira] [Comment Edited] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2019-07-27 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892595#comment-16892595
 ] 

Manu Zhang edited comment on SPARK-22063 at 7/28/19 4:19 AM:
-

[~hyukjin.kwon], [~felixcheung], is there any update in this thread ? Which 
lint-r version is used on Jenkins now ?

I also find installing lint-r would upgrade testthat to latest version while 
[SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]


was (Author: mauzhang):
Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would upgrade testthat to latest version while 
[SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> 

[jira] [Comment Edited] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2019-07-25 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892595#comment-16892595
 ] 

Manu Zhang edited comment on SPARK-22063 at 7/25/19 9:41 AM:
-

Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would upgrade testthat to latest version while 
[SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]


was (Author: mauzhang):
Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would also upgrade testthat to latest version 
while [SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> 

[jira] [Commented] (SPARK-22063) Upgrade lintr to latest commit sha1 ID

2019-07-25 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892595#comment-16892595
 ] 

Manu Zhang commented on SPARK-22063:


Is there any update in this thread ? Which lint-r version is used in build now ?

I also find upgrading lint-r would also upgrade testthat to latest version 
while [SparkR requires testthat 1.0.2 
|https://github.com/apache/spark/blob/master/docs/building-spark.md#running-r-tests]

> Upgrade lintr to latest commit sha1 ID
> --
>
> Key: SPARK-22063
> URL: https://issues.apache.org/jira/browse/SPARK-22063
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we set lintr to {{jimhester/lintr@a769c0b}} (see [this 
> pr|https://github.com/apache/spark/commit/7d1175011c976756efcd4e4e4f70a8fd6f287026])
>  and SPARK-14074.
> Today, I tried to upgrade the latest, 
> https://github.com/jimhester/lintr/commit/5431140ffea65071f1327625d4a8de9688fa7e72
> This fixes many bugs and now finds many instances that I have observed and 
> thought should be caught time to time:
> {code}
> inst/worker/worker.R:71:10: style: Remove spaces before the left parenthesis 
> in a function call.
>   return (output)
>  ^
> R/column.R:241:1: style: Lines should not be more than 100 characters.
> #'
> \href{https://spark.apache.org/docs/latest/sparkr.html#data-type-mapping-between-r-and-spark}{
> ^~~~
> R/context.R:332:1: style: Variable and function names should not be longer 
> than 30 characters.
> spark.getSparkFilesRootDirectory <- function() {
> ^~~~
> R/DataFrame.R:1912:1: style: Lines should not be more than 100 characters.
> #' @param j,select expression for the single Column or a list of columns to 
> select from the SparkDataFrame.
> ^~~
> R/DataFrame.R:1918:1: style: Lines should not be more than 100 characters.
> #' @return A new SparkDataFrame containing only the rows that meet the 
> condition with selected columns.
> ^~~
> R/DataFrame.R:2597:22: style: Remove spaces before the left parenthesis in a 
> function call.
>   return (joinRes)
>  ^
> R/DataFrame.R:2652:1: style: Variable and function names should not be longer 
> than 30 characters.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
> ^
> R/DataFrame.R:2652:47: style: Remove spaces before the left parenthesis in a 
> function call.
> generateAliasesForIntersectedCols <- function (x, intersectedColNames, 
> suffix) {
>   ^
> R/DataFrame.R:2660:14: style: Remove spaces before the left parenthesis in a 
> function call.
> stop ("The following column name: ", newJoin, " occurs more than once 
> in the 'DataFrame'.",
>  ^
> R/DataFrame.R:3047:1: style: Lines should not be more than 100 characters.
> #' @note The statistics provided by \code{summary} were change in 2.3.0 use 
> \link{describe} for previous defaults.
> ^~
> R/DataFrame.R:3754:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{cube} creates a single global 
> aggregate and is equivalent to
> ^~~
> R/DataFrame.R:3789:1: style: Lines should not be more than 100 characters.
> #' If grouping expression is missing \code{rollup} creates a single global 
> aggregate and is equivalent to
> ^
> R/deserialize.R:46:10: style: Remove spaces before the left parenthesis in a 
> function call.
>   switch (type,
>  ^
> R/functions.R:41:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{window}, it must be a time Column 
> of \code{TimestampType}.
> ^
> R/functions.R:93:1: style: Lines should not be more than 100 characters.
> #' @param x Column to compute on. In \code{shiftLeft}, \code{shiftRight} and 
> \code{shiftRightUnsigned},
> ^~~
> R/functions.R:483:52: 

[jira] [Commented] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-22 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16775783#comment-16775783
 ] 

Manu Zhang commented on SPARK-26977:


I'd love to

> Warn against subclassing scala.App doesn't work
> ---
>
> Key: SPARK-26977
> URL: https://issues.apache.org/jira/browse/SPARK-26977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.0
>Reporter: Manu Zhang
>Priority: Minor
>
> As per discussion in 
> [PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], 
> the warn against subclassing scala.App doesn't work. For example,
> {code:scala}
> object Test extends scala.App {
>// spark code
> }
> {code}
> Scala will compile {{object Test}} into two Java classes, {{Test}} passed in 
> by user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
> {{Test}}  and thus there will be no warn when user's application subclassing 
> {{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26977) Warn against subclassing scala.App doesn't work

2019-02-22 Thread Manu Zhang (JIRA)
Manu Zhang created SPARK-26977:
--

 Summary: Warn against subclassing scala.App doesn't work
 Key: SPARK-26977
 URL: https://issues.apache.org/jira/browse/SPARK-26977
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.0
Reporter: Manu Zhang


As per discussion in 
[PR#3497|https://github.com/apache/spark/pull/3497#discussion_r258412735], the 
warn against subclassing scala.App doesn't work. For example,


{code:scala}
object Test extends scala.App {
   // spark code
}
{code}

Scala will compile {{object Test}} into two Java classes, {{Test}} passed in by 
user and {{Test$}} subclassing {{scala.App}}. Currect code checks against 
{{Test}}  and thus there will be no warn when user's application subclassing 
{{scala.App}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features

2018-09-15 Thread Manu Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16616243#comment-16616243
 ] 

Manu Zhang commented on SPARK-15041:


Is there a plan to add such strategies as min/max ?

> adding mode strategy for ml.feature.Imputer for categorical features
> 
>
> Key: SPARK-15041
> URL: https://issues.apache.org/jira/browse/SPARK-15041
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: yuhao yang
>Priority: Minor
>
> Adding mode strategy for ml.feature.Imputer for categorical features. This 
> need to wait until PR for SPARK-13568 gets merged.
> https://github.com/apache/spark/pull/11601
> From comments of jkbradley and Nick Pentreath in the PR
> {quote}
> Investigate efficiency of approaches using DataFrame/Dataset and/or approx 
> approaches such as frequentItems or Count-Min Sketch (will require an update 
> to CMS to return "heavy-hitters").
> investigate if we can use metadata to only allow mode for categorical 
> features (or perhaps as an easier alternative, allow mode for only Int/Long 
> columns)
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-09-13 Thread Manu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16165674#comment-16165674
 ] 

Manu Zhang commented on SPARK-12837:


[~todd_leo], please check out [executor side 
broadcast|https://issues.apache.org/jira/browse/SPARK-17556]

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org