[jira] [Created] (SPARK-26461) Use ConfigEntry for hardcoded configs for dynamicAllocation category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26461:
-

 Summary: Use ConfigEntry for hardcoded configs for 
dynamicAllocation category.
 Key: SPARK-26461
 URL: https://issues.apache.org/jira/browse/SPARK-26461
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26468) Use ConfigEntry for hardcoded configs for task category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26468:
-

 Summary: Use ConfigEntry for hardcoded configs for task category.
 Key: SPARK-26468
 URL: https://issues.apache.org/jira/browse/SPARK-26468
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26463) Use ConfigEntry for hardcoded configs for scheduler category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26463:
-

 Summary: Use ConfigEntry for hardcoded configs for scheduler 
category.
 Key: SPARK-26463
 URL: https://issues.apache.org/jira/browse/SPARK-26463
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26478) Use ConfigEntry for hardcoded configs for rdd category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26478:
-

 Summary: Use ConfigEntry for hardcoded configs for rdd category.
 Key: SPARK-26478
 URL: https://issues.apache.org/jira/browse/SPARK-26478
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26491) Use ConfigEntry for hardcoded configs for test category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26491:
-

 Summary: Use ConfigEntry for hardcoded configs for test category.
 Key: SPARK-26491
 URL: https://issues.apache.org/jira/browse/SPARK-26491
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26489) Use ConfigEntry for hardcoded configs for python category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26489:
-

 Summary: Use ConfigEntry for hardcoded configs for python category.
 Key: SPARK-26489
 URL: https://issues.apache.org/jira/browse/SPARK-26489
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26490) Use ConfigEntry for hardcoded configs for r category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26490:
-

 Summary: Use ConfigEntry for hardcoded configs for r category.
 Key: SPARK-26490
 URL: https://issues.apache.org/jira/browse/SPARK-26490
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26488) Use ConfigEntry for hardcoded configs for modify.acl category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26488:
-

 Summary: Use ConfigEntry for hardcoded configs for modify.acl 
category.
 Key: SPARK-26488
 URL: https://issues.apache.org/jira/browse/SPARK-26488
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26483) Use ConfigEntry for hardcoded configs for ssl category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26483:
-

 Summary: Use ConfigEntry for hardcoded configs for ssl category.
 Key: SPARK-26483
 URL: https://issues.apache.org/jira/browse/SPARK-26483
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26486) Use ConfigEntry for hardcoded configs for metrics category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26486:
-

 Summary: Use ConfigEntry for hardcoded configs for metrics 
category.
 Key: SPARK-26486
 URL: https://issues.apache.org/jira/browse/SPARK-26486
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26477) Use ConfigEntry for hardcoded configs for unsafe category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26477:
-

 Summary: Use ConfigEntry for hardcoded configs for unsafe category.
 Key: SPARK-26477
 URL: https://issues.apache.org/jira/browse/SPARK-26477
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26476) Use ConfigEntry for hardcoded configs for cleaner category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26476:
-

 Summary: Use ConfigEntry for hardcoded configs for cleaner 
category.
 Key: SPARK-26476
 URL: https://issues.apache.org/jira/browse/SPARK-26476
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26481) Use ConfigEntry for hardcoded configs for reducer category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26481:
-

 Summary: Use ConfigEntry for hardcoded configs for reducer 
category.
 Key: SPARK-26481
 URL: https://issues.apache.org/jira/browse/SPARK-26481
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26479) Use ConfigEntry for hardcoded configs for locality category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26479:
-

 Summary: Use ConfigEntry for hardcoded configs for locality 
category.
 Key: SPARK-26479
 URL: https://issues.apache.org/jira/browse/SPARK-26479
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26482) Use ConfigEntry for hardcoded configs for ui category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26482:
-

 Summary: Use ConfigEntry for hardcoded configs for ui category.
 Key: SPARK-26482
 URL: https://issues.apache.org/jira/browse/SPARK-26482
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26485) Use ConfigEntry for hardcoded configs for master.rest/ui categories.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26485:
-

 Summary: Use ConfigEntry for hardcoded configs for master.rest/ui 
categories.
 Key: SPARK-26485
 URL: https://issues.apache.org/jira/browse/SPARK-26485
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26487) Use ConfigEntry for hardcoded configs for admin category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26487:
-

 Summary: Use ConfigEntry for hardcoded configs for admin category.
 Key: SPARK-26487
 URL: https://issues.apache.org/jira/browse/SPARK-26487
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26480) Use ConfigEntry for hardcoded configs for broadcast category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26480:
-

 Summary: Use ConfigEntry for hardcoded configs for broadcast 
category.
 Key: SPARK-26480
 URL: https://issues.apache.org/jira/browse/SPARK-26480
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26484) Use ConfigEntry for hardcoded configs for authenticate category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26484:
-

 Summary: Use ConfigEntry for hardcoded configs for authenticate 
category.
 Key: SPARK-26484
 URL: https://issues.apache.org/jira/browse/SPARK-26484
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26469) Use ConfigEntry for hardcoded configs for io category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26469:
-

 Summary: Use ConfigEntry for hardcoded configs for io category.
 Key: SPARK-26469
 URL: https://issues.apache.org/jira/browse/SPARK-26469
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26475) Use ConfigEntry for hardcoded configs for buffer category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26475:
-

 Summary: Use ConfigEntry for hardcoded configs for buffer category.
 Key: SPARK-26475
 URL: https://issues.apache.org/jira/browse/SPARK-26475
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26473) Use ConfigEntry for hardcoded configs for deploy category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26473:
-

 Summary: Use ConfigEntry for hardcoded configs for deploy category.
 Key: SPARK-26473
 URL: https://issues.apache.org/jira/browse/SPARK-26473
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26471) Use ConfigEntry for hardcoded configs for speculation category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26471:
-

 Summary: Use ConfigEntry for hardcoded configs for speculation 
category.
 Key: SPARK-26471
 URL: https://issues.apache.org/jira/browse/SPARK-26471
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26474) Use ConfigEntry for hardcoded configs for worker category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26474:
-

 Summary: Use ConfigEntry for hardcoded configs for worker category.
 Key: SPARK-26474
 URL: https://issues.apache.org/jira/browse/SPARK-26474
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26472) Use ConfigEntry for hardcoded configs for serializer category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26472:
-

 Summary: Use ConfigEntry for hardcoded configs for serializer 
category.
 Key: SPARK-26472
 URL: https://issues.apache.org/jira/browse/SPARK-26472
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26467) Use ConfigEntry for hardcoded configs for rpc category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26467:
-

 Summary: Use ConfigEntry for hardcoded configs for rpc category.
 Key: SPARK-26467
 URL: https://issues.apache.org/jira/browse/SPARK-26467
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26462) Use ConfigEntry for hardcoded configs for memory category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26462:
-

 Summary: Use ConfigEntry for hardcoded configs for memory category.
 Key: SPARK-26462
 URL: https://issues.apache.org/jira/browse/SPARK-26462
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26466) Use ConfigEntry for hardcoded configs for submit category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26466:
-

 Summary: Use ConfigEntry for hardcoded configs for submit category.
 Key: SPARK-26466
 URL: https://issues.apache.org/jira/browse/SPARK-26466
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26465) Use ConfigEntry for hardcoded configs for jars category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26465:
-

 Summary: Use ConfigEntry for hardcoded configs for jars category.
 Key: SPARK-26465
 URL: https://issues.apache.org/jira/browse/SPARK-26465
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26464) Use ConfigEntry for hardcoded configs for storage category.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26464:
-

 Summary: Use ConfigEntry for hardcoded configs for storage 
category.
 Key: SPARK-26464
 URL: https://issues.apache.org/jira/browse/SPARK-26464
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26460) Use ConfigEntry for hardcoded kryo/kryoserializer configs.

2018-12-27 Thread Takuya Ueshin (JIRA)
Takuya Ueshin created SPARK-26460:
-

 Summary: Use ConfigEntry for hardcoded kryo/kryoserializer configs.
 Key: SPARK-26460
 URL: https://issues.apache.org/jira/browse/SPARK-26460
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Takuya Ueshin






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26460) Use ConfigEntry for hardcoded configs for kryo/kryoserializer categories.

2018-12-27 Thread Takuya Ueshin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin updated SPARK-26460:
--
Summary: Use ConfigEntry for hardcoded configs for kryo/kryoserializer 
categories.  (was: Use ConfigEntry for hardcoded kryo/kryoserializer configs.)

> Use ConfigEntry for hardcoded configs for kryo/kryoserializer categories.
> -
>
> Key: SPARK-26460
> URL: https://issues.apache.org/jira/browse/SPARK-26460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS

2018-12-27 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730084#comment-16730084
 ] 

Saisai Shao commented on SPARK-20415:
-

Have you tried latest version of Spark, does this problem still exist in latest 
version? Also can we have a way to reproduce this problem easily?

> SPARK job hangs while writing DataFrame to HDFS
> ---
>
> Key: SPARK-20415
> URL: https://issues.apache.org/jira/browse/SPARK-20415
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 2.1.0
> Environment: EMR 5.4.0
>Reporter: P K
>Priority: Major
>
> We are in POC phase with Spark. One of the Steps is reading compressed json 
> files that come from sources, "explode" them into tabular format and then 
> write them to HDFS. This worked for about three weeks until a few days ago, 
> for a particular dataset, the writer just hangs. I logged in to the worker 
> machines and see this stack trace:
> "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 
> tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000]
>java.lang.Thread.State: RUNNABLE
> at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210)
> at 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111)
> at 
> org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> The last messages ever printed in stderr before the hang are:
> 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at 
> NativeMethodAccessorImpl.java:0)
> 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List()
> 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 
> (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has 
> no missing parents
> 17/04/18 01:41:14 INFO MemoryStore: Block 

[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-12-27 Thread Cheng Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730029#comment-16730029
 ] 

Cheng Su commented on SPARK-26164:
--

[~cloud_fan], could you help take a look of 
[https://github.com/apache/spark/pull/23163] ? Thanks!

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26459) remove UpdateNullabilityInAttributeReferences

2018-12-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730026#comment-16730026
 ] 

Apache Spark commented on SPARK-26459:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23390

> remove UpdateNullabilityInAttributeReferences
> -
>
> Key: SPARK-26459
> URL: https://issues.apache.org/jira/browse/SPARK-26459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26459) remove UpdateNullabilityInAttributeReferences

2018-12-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730027#comment-16730027
 ] 

Apache Spark commented on SPARK-26459:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/23390

> remove UpdateNullabilityInAttributeReferences
> -
>
> Key: SPARK-26459
> URL: https://issues.apache.org/jira/browse/SPARK-26459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26459) remove UpdateNullabilityInAttributeReferences

2018-12-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26459:


Assignee: Wenchen Fan  (was: Apache Spark)

> remove UpdateNullabilityInAttributeReferences
> -
>
> Key: SPARK-26459
> URL: https://issues.apache.org/jira/browse/SPARK-26459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26459) remove UpdateNullabilityInAttributeReferences

2018-12-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26459:


Assignee: Apache Spark  (was: Wenchen Fan)

> remove UpdateNullabilityInAttributeReferences
> -
>
> Key: SPARK-26459
> URL: https://issues.apache.org/jira/browse/SPARK-26459
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26459) remove UpdateNullabilityInAttributeReferences

2018-12-27 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-26459:
---

 Summary: remove UpdateNullabilityInAttributeReferences
 Key: SPARK-26459
 URL: https://issues.apache.org/jira/browse/SPARK-26459
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26458) OneHotEncoderModel verifies the number of category values incorrectly when tries to transform a dataframe.

2018-12-27 Thread duruihuan (JIRA)
duruihuan created SPARK-26458:
-

 Summary: OneHotEncoderModel verifies the number of category values 
incorrectly when tries to transform a dataframe.
 Key: SPARK-26458
 URL: https://issues.apache.org/jira/browse/SPARK-26458
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.1
Reporter: duruihuan


When the handleInvalid is set to "keep", then one should not compare the 
categorySizes of the tranformSchema and the values of the metadata of the 
dataframe to be transformed. Because there may be more than one invalid values 
in some columns in the dataframe, which causes exception as described in lines 
302-306 in OneHotEncoderEstimator.scala. To be concluded, I think the 
verifyNumOfValues in the method transformSchema should be removed, which can be 
found in line 299 in the code.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager

2018-12-27 Thread Qingxin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qingxin Wu updated SPARK-26446:
---
Description: 
Add docs to describe how remove policy act while considering the property 
_*{{spark.dynamicAllocation.cachedExecutorIdleTimeout}}*_ in 
ExecutorAllocationManager.

 

 

  was:
 

Add docs to describe how remove policy act while considering the property
{code:java}
spark.dynamicAllocation.cachedExecutorIdleTimeout
{code}
  in ExecutorAllocationManager.

 

 

 


> Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
> ---
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Minor
>
> Add docs to describe how remove policy act while considering the property 
> _*{{spark.dynamicAllocation.cachedExecutorIdleTimeout}}*_ in 
> ExecutorAllocationManager.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager

2018-12-27 Thread Qingxin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qingxin Wu updated SPARK-26446:
---
Description: 
 

Add docs to describe how remove policy act while considering the property
{code:java}
spark.dynamicAllocation.cachedExecutorIdleTimeout
{code}
  in ExecutorAllocationManager.

 

 

 

  was:
Add docs to describe how remove policy act while considering the property 
{{spark.dynamicAllocation.cachedExecutorIdleTimeout}} in 
ExecutorAllocationManager.

 

 


> Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
> ---
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Minor
>
>  
> Add docs to describe how remove policy act while considering the property
> {code:java}
> spark.dynamicAllocation.cachedExecutorIdleTimeout
> {code}
>   in ExecutorAllocationManager.
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming

2018-12-27 Thread Andrey Siunov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730007#comment-16730007
 ] 

Andrey Siunov commented on SPARK-22579:
---

Is it true that the issue has been resolved in the ticket 
https://issues.apache.org/jira/browse/SPARK-25905 (PR: 
https://github.com/apache/spark/pull/23058)?

> BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be 
> implemented using streaming
> --
>
> Key: SPARK-22579
> URL: https://issues.apache.org/jira/browse/SPARK-22579
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager, Spark Core
>Affects Versions: 2.1.0
>Reporter: Eyal Farago
>Priority: Major
>
> when an RDD partition is cached on an executor bu the task requiring it is 
> running on another executor (process locality ANY), the cached partition is 
> fetched via BlockManager.getRemoteValues which delegates to 
> BlockManager.getRemoteBytes, both calls are blocking.
> in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes 
> cluster, cached to disk. rough math shows that average partition size is 
> 700MB.
> looking at spark UI it was obvious that tasks running with process locality 
> 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I 
> was able to capture thread dumps of executors executing remote tasks and got 
> this stake trace:
> {quote}Thread ID  Thread Name Thread StateThread Locks
> 1521  Executor task launch worker-1000WAITING 
> Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978})
> sun.misc.Unsafe.park(Native Method)
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> scala.concurrent.Await$.result(package.scala:190)
> org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
> org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
> org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582)
> org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550)
> org.apache.spark.storage.BlockManager.get(BlockManager.scala:638)
> org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690)
> org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote}
> digging into the code showed that the block manager first fetches all bytes 
> (getRemoteBytes) and then wraps it with a deserialization stream, this has 
> several draw backs:
> 1. blocking, requesting executor is blocked while the remote executor is 
> serving the block.
> 2. potentially large memory footprint on requesting executor, in my use case 
> a 700mb of raw bytes stored in a ChunkedByteBuffer.
> 3. inefficient, requesting side usually don't need all values at once as it 
> consumes the values via an iterator.
> 4. potentially large memory footprint on serving executor, in case the block 
> is cached in deserialized form the serving executor has to serialize it into 
> a ChunkedByteBuffer (BlockManager.doGetLocalBytes). this is both memory & CPU 
> intensive, memory footprint can be reduced by using a limited buffer for 
> serialization 'spilling' to the response stream.
> I suggest improving this either by implementing full streaming mechanism or 
> some kind of pagination mechanism, in addition the requesting executor should 
> be able to make progress with the data it already has, blocking only when 
> local buffer is exhausted and remote side didn't deliver the next chunk of 
> the stream (or page in case of pagination) yet.



--
This message was sent by Atlassian JIRA

[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager

2018-12-27 Thread Qingxin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qingxin Wu updated SPARK-26446:
---
Description: 
Add docs to describe how remove policy act while considering the property 
{{spark.dynamicAllocation.cachedExecutorIdleTimeout}} in 
ExecutorAllocationManager.

 

 

> Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
> ---
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Minor
>
> Add docs to describe how remove policy act while considering the property 
> {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} in 
> ExecutorAllocationManager.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager

2018-12-27 Thread Qingxin Wu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Qingxin Wu updated SPARK-26446:
---
Summary: Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager  
(was: improve doc on ExecutorAllocationManager)

> Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
> ---
>
> Key: SPARK-26446
> URL: https://issues.apache.org/jira/browse/SPARK-26446
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Qingxin Wu
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark

2018-12-27 Thread Jackey Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730003#comment-16730003
 ] 

Jackey Lee commented on SPARK-24630:


[~jackylk]

I have finished a detailed design doc, we can talk about it in the [mail 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Support-SqlStreaming-in-spark-td24202.html]
 or the 
[doc|https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit].

> SPIP: Support SQLStreaming in Spark
> ---
>
> Key: SPARK-24630
> URL: https://issues.apache.org/jira/browse/SPARK-24630
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.2.0, 2.2.1
>Reporter: Jackey Lee
>Priority: Minor
>  Labels: SQLStreaming
> Attachments: SQLStreaming SPIP V2.pdf
>
>
> At present, KafkaSQL, Flink SQL(which is actually based on Calcite), 
> SQLStream, StormSQL all provide a stream type SQL interface, with which users 
> with little knowledge about streaming,  can easily develop a flow system 
> processing model. In Spark, we can also support SQL API based on 
> StructStreamig.
> To support for SQL Streaming, there are two key points: 
> 1, Analysis should be able to parse streaming type SQL. 
> 2, Analyzer should be able to map metadata information to the corresponding 
> Relation. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26437) Decimal data becomes bigint to query, unable to query

2018-12-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26437:
--
Affects Version/s: 1.6.3

> Decimal data becomes bigint to query, unable to query
> -
>
> Key: SPARK-26437
> URL: https://issues.apache.org/jira/browse/SPARK-26437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1
>Reporter: zengxl
>Priority: Major
> Fix For: 3.0.0
>
>
> this is my sql:
> create table tmp.tmp_test_6387_1224_spark  stored  as ORCFile  as select 0.00 
> as a
> select a from tmp.tmp_test_6387_1224_spark
> CREATE TABLE `tmp.tmp_test_6387_1224_spark`(
>  {color:#f79232} `a` decimal(2,2)){color}
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> When I query this table(use hive or sparksql,the exception is same), I throw 
> the following exception information
> *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
> stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
> limit: 0*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26437) Decimal data becomes bigint to query, unable to query

2018-12-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26437.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

> Decimal data becomes bigint to query, unable to query
> -
>
> Key: SPARK-26437
> URL: https://issues.apache.org/jira/browse/SPARK-26437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1
>Reporter: zengxl
>Priority: Major
> Fix For: 3.0.0
>
>
> this is my sql:
> create table tmp.tmp_test_6387_1224_spark  stored  as ORCFile  as select 0.00 
> as a
> select a from tmp.tmp_test_6387_1224_spark
> CREATE TABLE `tmp.tmp_test_6387_1224_spark`(
>  {color:#f79232} `a` decimal(2,2)){color}
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> When I query this table(use hive or sparksql,the exception is same), I throw 
> the following exception information
> *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
> stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
> limit: 0*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26437) Decimal data becomes bigint to query, unable to query

2018-12-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729945#comment-16729945
 ] 

Dongjoon Hyun commented on SPARK-26437:
---

Hi, [~zengxl].

Thank you for reporting. This is a very old issue since Apache Spark 1.x which 
occurs when you use `decimal`. Please note that `CAST` and `decimal` in the 
following example. Since Spark 2.0, `0.0` literal interpreted as `Decimal`. So, 
you are hitting this issue without casting, too. This is fixed at `master` 
branch and will be released as Apache Spark 3.0.0.

{code}
scala> sc.version
res0: String = 1.6.3

scala> sql("drop table spark_orc")
scala> sql("create table spark_orc stored as orc as select cast(0.00 as 
decimal(2,2)) as a")
scala> sql("select * from spark_orc").show
...
Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
limit: 0
{code}

If you are interested, the followings are the details.

First, the underlying ORC issue (HIVE-13083) is fixed at Hive 1.3.0, but Spark 
is still using embedded Hive 1.2.1. To avoid the underlying ORC issue, you can 
use new ORC data source (`set spark.sql.orc.impl=native`). So, in Spark 2.4.0, 
you can use `USING` syntax to avoid this.

{code}
scala> sql("create table spark_orc using orc as select 0.00 as a")
scala> sql("select * from spark_orc").show
++
|   a|
++
|0.00|
++

scala> spark.version
res2: String = 2.4.0
{code}

Second, SPARK-22977 made a regression on CTAS at Spark 2.3.0 and is fixed 
recently SPARK-25271 (Hive CTAS commands should use data source if it is 
convertible) at Apache Spark 3.0.0. In Spark 3.0.0, you can use `STORED AS ORC` 
syntax without this problem.
{code}
scala> sql("create table spark_orc stored as orc as select 0.00 as a")
scala> sql("select * from spark_orc").show
++
|   a|
++
|0.00|
++

scala> spark.version
res3: String = 3.0.0-SNAPSHOT
{code}

So, I'll close this issue since this is fixed in 3.0.0.

cc [~cloud_fan], [~viirya], [~smilegator], [~hyukjin.kwon]

> Decimal data becomes bigint to query, unable to query
> -
>
> Key: SPARK-26437
> URL: https://issues.apache.org/jira/browse/SPARK-26437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1
>Reporter: zengxl
>Priority: Major
>
> this is my sql:
> create table tmp.tmp_test_6387_1224_spark  stored  as ORCFile  as select 0.00 
> as a
> select a from tmp.tmp_test_6387_1224_spark
> CREATE TABLE `tmp.tmp_test_6387_1224_spark`(
>  {color:#f79232} `a` decimal(2,2)){color}
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> When I query this table(use hive or sparksql,the exception is same), I throw 
> the following exception information
> *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
> stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
> limit: 0*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26437) Decimal data becomes bigint to query, unable to query

2018-12-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26437:
--
Affects Version/s: 2.0.2
   2.1.3
   2.2.2

> Decimal data becomes bigint to query, unable to query
> -
>
> Key: SPARK-26437
> URL: https://issues.apache.org/jira/browse/SPARK-26437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1
>Reporter: zengxl
>Priority: Major
>
> this is my sql:
> create table tmp.tmp_test_6387_1224_spark  stored  as ORCFile  as select 0.00 
> as a
> select a from tmp.tmp_test_6387_1224_spark
> CREATE TABLE `tmp.tmp_test_6387_1224_spark`(
>  {color:#f79232} `a` decimal(2,2)){color}
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> When I query this table(use hive or sparksql,the exception is same), I throw 
> the following exception information
> *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
> stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
> limit: 0*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17572) Write.df is failing on spark cluster

2018-12-27 Thread Tarun Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729921#comment-16729921
 ] 

Tarun Parmar edited comment on SPARK-17572 at 12/27/18 10:51 PM:
-

I am facing similar issue, my Spark+Hadoop version is same as Sankar's. I am 
using Spark with RStudio without hadoop to generate parquet files and store 
them in local/nfs mount. 

What I noticed is the _temporary directory is owned by my userid but the '0' 
directory inside _temporary is owned by root which is probably why it is 
failing to delete. 

Already checked with RStudio, they don't think this it is an issue with 
sparklyr package. 

 

 


was (Author: tarunparmar):
I am facing similar issue, my Spark+Hadoop version is same as Sankar's. I am 
using Spark with RStudio without hadoop to generate parquet files and store 
them in local/nfs mount. 

What I noticed is the _temporary directory is owned by my userid but the '0' 
directory inside _temporary is owned by root which is probably why it is 
failing to delete. 

Already checked with RStudio, they don't this it is an issue with sparklyr 
package. 

 

 

> Write.df is failing on spark cluster
> 
>
> Key: SPARK-17572
> URL: https://issues.apache.org/jira/browse/SPARK-17572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Sankar Mittapally
>Priority: Major
>
> Hi,
>  We have spark cluster with four nodes, all four nodes have NFS partition 
> shared(there is no HDFS), We have same uid on all servers. When we are trying 
> to write data we are getting following exceptions. I am not sure whether it 
> is a error or not and not sure will I lost the data in the output.
> The command which I am using to save the data.
> {code}
> saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true")
> {code}
> {noformat}
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.io.IOException: Failed to rename 
> DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv;
>  isDirectory=false; length=436486316; replication=1; blocksize=33554432; 
> modification_time=147409940; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false} to 
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
> at 

[jira] [Commented] (SPARK-26437) Decimal data becomes bigint to query, unable to query

2018-12-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729925#comment-16729925
 ] 

Dongjoon Hyun commented on SPARK-26437:
---

Thanks, [~mgaido].

> Decimal data becomes bigint to query, unable to query
> -
>
> Key: SPARK-26437
> URL: https://issues.apache.org/jira/browse/SPARK-26437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zengxl
>Priority: Major
>
> this is my sql:
> create table tmp.tmp_test_6387_1224_spark  stored  as ORCFile  as select 0.00 
> as a
> select a from tmp.tmp_test_6387_1224_spark
> CREATE TABLE `tmp.tmp_test_6387_1224_spark`(
>  {color:#f79232} `a` decimal(2,2)){color}
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> When I query this table(use hive or sparksql,the exception is same), I throw 
> the following exception information
> *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
> stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
> limit: 0*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17572) Write.df is failing on spark cluster

2018-12-27 Thread Tarun Parmar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729921#comment-16729921
 ] 

Tarun Parmar commented on SPARK-17572:
--

I am facing similar issue, my Spark+Hadoop version is same as Sankar's. I am 
using Spark with RStudio without hadoop to generate parquet files and store 
them in local/nfs mount. 

What I noticed is the _temporary directory is owned by my userid but the '0' 
directory inside _temporary is owned by root which is probably why it is 
failing to delete. 

Already checked with RStudio, they don't this it is an issue with sparklyr 
package. 

 

 

> Write.df is failing on spark cluster
> 
>
> Key: SPARK-17572
> URL: https://issues.apache.org/jira/browse/SPARK-17572
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.0.0
>Reporter: Sankar Mittapally
>Priority: Major
>
> Hi,
>  We have spark cluster with four nodes, all four nodes have NFS partition 
> shared(there is no HDFS), We have same uid on all servers. When we are trying 
> to write data we are getting following exceptions. I am not sure whether it 
> is a error or not and not sure will I lost the data in the output.
> The command which I am using to save the data.
> {code}
> saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true")
> {code}
> {noformat}
> 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job.
> java.io.IOException: Failed to rename 
> DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv;
>  isDirectory=false; length=436486316; replication=1; blocksize=33554432; 
> modification_time=147409940; access_time=0; owner=; group=; 
> permission=rw-rw-rw-; isSymlink=false} to 
> file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
> at 
> org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
> at 
> org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86)
> at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
> at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> 

[jira] [Assigned] (SPARK-26450) Map of schema is built too frequently in some wide queries

2018-12-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26450:


Assignee: Apache Spark

> Map of schema is built too frequently in some wide queries
> --
>
> Key: SPARK-26450
> URL: https://issues.apache.org/jira/browse/SPARK-26450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> When executing queries with wide projections and wide schemas, Spark rebuilds 
> an attribute map for the same schema many times.
> For example:
> {noformat}
> select * from orctbl where id1 = 1
> {noformat}
> Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above 
> query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
> instantiation builds a map of the entire list of 6000 attributes (but not 
> until lazy val exprIdToOrdinal is referenced).
> Whenever OrcFileFormat reads a new file, it generates a new unsafe 
> projection. That results in this 
> [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
>  getting called:
> {code:java}
> protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] =
> in.map(BindReferences.bindReference(_, inputSchema))
> {code}
> For each column in the projection, this line calls bindReference. Each call 
> passes inputSchema, a Sequence of Attributes, to a parameter position 
> expecting an AttributeSeq. The compiler implicitly calls the constructor for 
> AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
> Therefore, this function builds a map of the entire schema once for each 
> column in the projection, and it does this for each input file. For the above 
> example query, this accounts for 204K instantiations of AttributeSeq.
> Readers for CSV and JSON tables do something similar.
> In addition, ProjectExec also creates an unsafe projection for each task. As 
> a result, this 
> [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
>  gets called, which has the same issue:
> {code:java}
>   def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] = {
> exprs.map(BindReferences.bindReference(_, inputSchema))
>   }
> {code}
> The above affects all wide queries that have a projection node, regardless of 
> the file reader. For the example query, ProjectExec accounts for the 
> additional 66K instantiations of the AttributeSeq.
> Spark can save time by pre-building the AttributeSeq right before the map 
> operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size 
> of schema, size of projection, number of input files (for Orc), number of 
> file splits (for CSV, and JSON tables), and number of tasks.
> For a 6000 column CSV table with 500K records and 34 input files, the time 
> savings is only 6%[1] because Spark doesn't create as many unsafe projections 
> as compared to Orc tables.
> On the other hand, for a 6000 column Orc table with 500K records and 34 input 
> files, the time savings is about 16%[1].
> [1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26450) Map of schema is built too frequently in some wide queries

2018-12-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26450:


Assignee: (was: Apache Spark)

> Map of schema is built too frequently in some wide queries
> --
>
> Key: SPARK-26450
> URL: https://issues.apache.org/jira/browse/SPARK-26450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When executing queries with wide projections and wide schemas, Spark rebuilds 
> an attribute map for the same schema many times.
> For example:
> {noformat}
> select * from orctbl where id1 = 1
> {noformat}
> Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above 
> query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
> instantiation builds a map of the entire list of 6000 attributes (but not 
> until lazy val exprIdToOrdinal is referenced).
> Whenever OrcFileFormat reads a new file, it generates a new unsafe 
> projection. That results in this 
> [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
>  getting called:
> {code:java}
> protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] =
> in.map(BindReferences.bindReference(_, inputSchema))
> {code}
> For each column in the projection, this line calls bindReference. Each call 
> passes inputSchema, a Sequence of Attributes, to a parameter position 
> expecting an AttributeSeq. The compiler implicitly calls the constructor for 
> AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
> Therefore, this function builds a map of the entire schema once for each 
> column in the projection, and it does this for each input file. For the above 
> example query, this accounts for 204K instantiations of AttributeSeq.
> Readers for CSV and JSON tables do something similar.
> In addition, ProjectExec also creates an unsafe projection for each task. As 
> a result, this 
> [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
>  gets called, which has the same issue:
> {code:java}
>   def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] = {
> exprs.map(BindReferences.bindReference(_, inputSchema))
>   }
> {code}
> The above affects all wide queries that have a projection node, regardless of 
> the file reader. For the example query, ProjectExec accounts for the 
> additional 66K instantiations of the AttributeSeq.
> Spark can save time by pre-building the AttributeSeq right before the map 
> operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size 
> of schema, size of projection, number of input files (for Orc), number of 
> file splits (for CSV, and JSON tables), and number of tasks.
> For a 6000 column CSV table with 500K records and 34 input files, the time 
> savings is only 6%[1] because Spark doesn't create as many unsafe projections 
> as compared to Orc tables.
> On the other hand, for a 6000 column Orc table with 500K records and 34 input 
> files, the time savings is about 16%[1].
> [1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2018-12-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26021:
--
Fix Version/s: (was: 2.4.1)

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Alon Doron
>Priority: Critical
> Fix For: 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive

2018-12-27 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729804#comment-16729804
 ] 

Dongjoon Hyun commented on SPARK-26021:
---

This is reverted from `branch-2.4` via 
https://github.com/apache/spark/pull/23389 .

> -0.0 and 0.0 not treated consistently, doesn't match Hive
> -
>
> Key: SPARK-26021
> URL: https://issues.apache.org/jira/browse/SPARK-26021
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Assignee: Alon Doron
>Priority: Critical
> Fix For: 3.0.0
>
>
> Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new 
> issue:
> The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are 
> numerically identical but not the same double value:
> In hive, 0.0 and -0.0 are equal since 
> https://issues.apache.org/jira/browse/HIVE-11174.
>  That's not the case with spark sql as "group by" (non-codegen) treats them 
> as different values. Since their hash is different they're put in different 
> buckets of UnsafeFixedWidthAggregationMap.
> In addition there's an inconsistency when using the codegen, for example the 
> following unit test:
> {code:java}
> println(Seq(0.0d, 0.0d, 
> -0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,3]
> {code:java}
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,1], [-0.0,2]
> {code:java}
> spark.conf.set("spark.sql.codegen.wholeStage", "false")
> println(Seq(0.0d, -0.0d, 
> 0.0d).toDF("i").groupBy("i").count().collect().mkString(", "))
> {code}
> [0.0,2], [-0.0,1]
> Note that the only difference between the first 2 lines is the order of the 
> elements in the Seq.
>  This inconsistency is resulted by different partitioning of the Seq and the 
> usage of the generated fast hash map in the first, partial, aggregation.
> It looks like we need to add a specific check for -0.0 before hashing (both 
> in codegen and non-codegen modes) if we want to fix this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26363) Avoid duplicated KV store lookups for task table

2018-12-27 Thread Gengliang Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-26363:
---
Description: 
In the method `taskList`(Since https://github.com/apache/spark/pull/21688),  
the executor log value is queried in KV store for every task(method 
`constructTaskData`).
We can use a hashmap for reducing duplicated KV store lookups in the method.



  was:
In https://github.com/apache/spark/pull/21688, a new filed `executorLogs` is 
added to `TaskData` in `api.scala`:
1. The field should not belong to `TaskData` (from the meaning of wording).
2. This is redundant with ExecutorSummary. 
3. For each row in the task table, the executor log value is lookup in KV store 
every time, which can be avoided for better performance in large scale.

This PR propose to reuse the executor details of request "/allexecutors" , so 
that we can have a cleaner api data structure, and redundant KV store queries 
are avoided. 




>  Avoid duplicated KV store lookups for task table
> -
>
> Key: SPARK-26363
> URL: https://issues.apache.org/jira/browse/SPARK-26363
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In the method `taskList`(Since https://github.com/apache/spark/pull/21688),  
> the executor log value is queried in KV store for every task(method 
> `constructTaskData`).
> We can use a hashmap for reducing duplicated KV store lookups in the method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab

2018-12-27 Thread deshanxiao (JIRA)
deshanxiao created SPARK-26457:
--

 Summary: Show hadoop configurations in HistoryServer environment 
tab
 Key: SPARK-26457
 URL: https://issues.apache.org/jira/browse/SPARK-26457
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Affects Versions: 2.4.0, 2.3.2
 Environment: Maybe it is good to show some configurations in 
HistoryServer environment tab for debugging some bugs about hadoop
Reporter: deshanxiao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26451) Change lead/lag argument name from count to offset

2018-12-27 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-26451:
--
 Docs Text: The 'lag' function in Pyspark accepted an argument 'count' 
which should have been called 'offset'. It has been renamed accordingly.
Labels: release-notes  (was: )
Issue Type: Bug  (was: Documentation)

> Change lead/lag argument name from count to offset
> --
>
> Key: SPARK-26451
> URL: https://issues.apache.org/jira/browse/SPARK-26451
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Deepyaman Datta
>Assignee: Deepyaman Datta
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26451) Change lead/lag argument name from count to offset

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26451.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23357
[https://github.com/apache/spark/pull/23357]

> Change lead/lag argument name from count to offset
> --
>
> Key: SPARK-26451
> URL: https://issues.apache.org/jira/browse/SPARK-26451
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Deepyaman Datta
>Assignee: Deepyaman Datta
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26451) Change lead/lag argument name from count to offset

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26451:


Assignee: Deepyaman Datta

> Change lead/lag argument name from count to offset
> --
>
> Key: SPARK-26451
> URL: https://issues.apache.org/jira/browse/SPARK-26451
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Deepyaman Datta
>Assignee: Deepyaman Datta
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26450) Map of schema is built too frequently in some wide queries

2018-12-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729675#comment-16729675
 ] 

Marco Gaido commented on SPARK-26450:
-

Great, thanks!

> Map of schema is built too frequently in some wide queries
> --
>
> Key: SPARK-26450
> URL: https://issues.apache.org/jira/browse/SPARK-26450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When executing queries with wide projections and wide schemas, Spark rebuilds 
> an attribute map for the same schema many times.
> For example:
> {noformat}
> select * from orctbl where id1 = 1
> {noformat}
> Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above 
> query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
> instantiation builds a map of the entire list of 6000 attributes (but not 
> until lazy val exprIdToOrdinal is referenced).
> Whenever OrcFileFormat reads a new file, it generates a new unsafe 
> projection. That results in this 
> [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
>  getting called:
> {code:java}
> protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] =
> in.map(BindReferences.bindReference(_, inputSchema))
> {code}
> For each column in the projection, this line calls bindReference. Each call 
> passes inputSchema, a Sequence of Attributes, to a parameter position 
> expecting an AttributeSeq. The compiler implicitly calls the constructor for 
> AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
> Therefore, this function builds a map of the entire schema once for each 
> column in the projection, and it does this for each input file. For the above 
> example query, this accounts for 204K instantiations of AttributeSeq.
> Readers for CSV and JSON tables do something similar.
> In addition, ProjectExec also creates an unsafe projection for each task. As 
> a result, this 
> [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
>  gets called, which has the same issue:
> {code:java}
>   def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] = {
> exprs.map(BindReferences.bindReference(_, inputSchema))
>   }
> {code}
> The above affects all wide queries that have a projection node, regardless of 
> the file reader. For the example query, ProjectExec accounts for the 
> additional 66K instantiations of the AttributeSeq.
> Spark can save time by pre-building the AttributeSeq right before the map 
> operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size 
> of schema, size of projection, number of input files (for Orc), number of 
> file splits (for CSV, and JSON tables), and number of tasks.
> For a 6000 column CSV table with 500K records and 34 input files, the time 
> savings is only 6%[1] because Spark doesn't create as many unsafe projections 
> as compared to Orc tables.
> On the other hand, for a 6000 column Orc table with 500K records and 34 input 
> files, the time savings is about 16%[1].
> [1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26456) Cast date/timestamp by Date/TimestampFormatter

2018-12-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26456:


Assignee: (was: Apache Spark)

> Cast date/timestamp by Date/TimestampFormatter
> --
>
> Key: SPARK-26456
> URL: https://issues.apache.org/jira/browse/SPARK-26456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Currently, dates and timestamps are casted to strings by using 
> SimpleDateFormat. The ticket aims to switch the code on new DateFormatter and 
> TimestampFormatter that are already used in CSV and JSON datasources for the 
> same purpose. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26456) Cast date/timestamp by Date/TimestampFormatter

2018-12-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26456:


Assignee: Apache Spark

> Cast date/timestamp by Date/TimestampFormatter
> --
>
> Key: SPARK-26456
> URL: https://issues.apache.org/jira/browse/SPARK-26456
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, dates and timestamps are casted to strings by using 
> SimpleDateFormat. The ticket aims to switch the code on new DateFormatter and 
> TimestampFormatter that are already used in CSV and JSON datasources for the 
> same purpose. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26456) Cast date/timestamp by Date/TimestampFormatter

2018-12-27 Thread Maxim Gekk (JIRA)
Maxim Gekk created SPARK-26456:
--

 Summary: Cast date/timestamp by Date/TimestampFormatter
 Key: SPARK-26456
 URL: https://issues.apache.org/jira/browse/SPARK-26456
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Maxim Gekk


Currently, dates and timestamps are casted to strings by using 
SimpleDateFormat. The ticket aims to switch the code on new DateFormatter and 
TimestampFormatter that are already used in CSV and JSON datasources for the 
same purpose. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26248) Infer date type from CSV

2018-12-27 Thread Maxim Gekk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk resolved SPARK-26248.

Resolution: Won't Fix

> Infer date type from CSV
> 
>
> Key: SPARK-26248
> URL: https://issues.apache.org/jira/browse/SPARK-26248
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Currently, DateType cannot be inferred from CSV. To parse CSV string, you 
> have to specify schema explicitly if CSV input contains dates. This ticket 
> aims to extend CSVInferSchema to support such inferring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26450) Map of schema is built too frequently in some wide queries

2018-12-27 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729647#comment-16729647
 ] 

Bruce Robbins commented on SPARK-26450:
---

I can attempt a patch later today.

> Map of schema is built too frequently in some wide queries
> --
>
> Key: SPARK-26450
> URL: https://issues.apache.org/jira/browse/SPARK-26450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When executing queries with wide projections and wide schemas, Spark rebuilds 
> an attribute map for the same schema many times.
> For example:
> {noformat}
> select * from orctbl where id1 = 1
> {noformat}
> Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above 
> query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
> instantiation builds a map of the entire list of 6000 attributes (but not 
> until lazy val exprIdToOrdinal is referenced).
> Whenever OrcFileFormat reads a new file, it generates a new unsafe 
> projection. That results in this 
> [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
>  getting called:
> {code:java}
> protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] =
> in.map(BindReferences.bindReference(_, inputSchema))
> {code}
> For each column in the projection, this line calls bindReference. Each call 
> passes inputSchema, a Sequence of Attributes, to a parameter position 
> expecting an AttributeSeq. The compiler implicitly calls the constructor for 
> AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
> Therefore, this function builds a map of the entire schema once for each 
> column in the projection, and it does this for each input file. For the above 
> example query, this accounts for 204K instantiations of AttributeSeq.
> Readers for CSV and JSON tables do something similar.
> In addition, ProjectExec also creates an unsafe projection for each task. As 
> a result, this 
> [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
>  gets called, which has the same issue:
> {code:java}
>   def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] = {
> exprs.map(BindReferences.bindReference(_, inputSchema))
>   }
> {code}
> The above affects all wide queries that have a projection node, regardless of 
> the file reader. For the example query, ProjectExec accounts for the 
> additional 66K instantiations of the AttributeSeq.
> Spark can save time by pre-building the AttributeSeq right before the map 
> operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size 
> of schema, size of projection, number of input files (for Orc), number of 
> file splits (for CSV, and JSON tables), and number of tasks.
> For a 6000 column CSV table with 500K records and 34 input files, the time 
> savings is only 6%[1] because Spark doesn't create as many unsafe projections 
> as compared to Orc tables.
> On the other hand, for a 6000 column Orc table with 500K records and 34 input 
> files, the time savings is about 16%[1].
> [1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25892) AttributeReference.withMetadata method should have return type AttributeReference

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-25892:


Assignee: kevin yu

> AttributeReference.withMetadata method should have return type 
> AttributeReference
> -
>
> Key: SPARK-25892
> URL: https://issues.apache.org/jira/browse/SPARK-25892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jari Kujansuu
>Assignee: kevin yu
>Priority: Trivial
> Fix For: 3.0.0
>
>
> AttributeReference.withMetadata method should have return type 
> AttributeReference instead of Attribute.
> AttributeReference overrides withMetadata method defined in Attribute super 
> class and returns AttributeReference instance but method's return type is 
> Attribute unlike in other with... methods overridden by AttributeReference.
> In some cases you have to cast the return value back to AttributeReference.
> For example if you want to modify metadata for AttributeReference in 
> LogicalRelation you have to cast return value of withMetadata back to 
> AttributeReference because LogicalRelation takes Seq[AttributeReference] as 
> argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25892) AttributeReference.withMetadata method should have return type AttributeReference

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25892.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 22918
[https://github.com/apache/spark/pull/22918]

> AttributeReference.withMetadata method should have return type 
> AttributeReference
> -
>
> Key: SPARK-25892
> URL: https://issues.apache.org/jira/browse/SPARK-25892
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Jari Kujansuu
>Assignee: kevin yu
>Priority: Trivial
> Fix For: 3.0.0
>
>
> AttributeReference.withMetadata method should have return type 
> AttributeReference instead of Attribute.
> AttributeReference overrides withMetadata method defined in Attribute super 
> class and returns AttributeReference instance but method's return type is 
> Attribute unlike in other with... methods overridden by AttributeReference.
> In some cases you have to cast the return value back to AttributeReference.
> For example if you want to modify metadata for AttributeReference in 
> LogicalRelation you have to cast return value of withMetadata back to 
> AttributeReference because LogicalRelation takes Seq[AttributeReference] as 
> argument.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26455) Spark Kinesis Integration with no SSL

2018-12-27 Thread Shashikant Bangera (JIRA)
Shashikant Bangera created SPARK-26455:
--

 Summary: Spark Kinesis Integration with no SSL
 Key: SPARK-26455
 URL: https://issues.apache.org/jira/browse/SPARK-26455
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Shashikant Bangera


Hi, 

we are trying access the endpoint thought library mentioned below and we get 
the SSL error i think internally it use KCL library. if you look at the error, 
so if I have to skip the certificate is it possible through KCL utils call ? 
because I do not find any provision to do that with set SSL as false within 
spark streaming kinesis library like we do with KCL. Can you please help me 
with the same.

compile("org.apache.spark:spark-streaming-kinesis-asl_2.11:2.3.0") {
 exclude group: 'org.apache.spark', module: 'spark-streaming_2.11'
}

Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for 
kinesis-endpoint> doesn't match any of the subject alternative names: 
[kinesis-fips.us-east-1.amazonaws.com, *.kinesis.us-east-1.vpce.amazonaws.com, 
kinesis.us-east-1.amazonaws.com]
 at 
org.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:467)
 at 
org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:397)
 at 
org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355)
 at 
shade.com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:132)
 at 
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
 at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373)
 at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
shade.com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
 at shade.com.amazonaws.http.conn.$Proxy18.connect(Unknown Source)
 at 
org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381)
 at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237)
 at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
 at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 at 
shade.com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
 at 
shade.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1238)
 at 
shade.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058)
 ... 20 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26454) IllegegalArgument Exception is Thrown while creating new UDF with JAR

2018-12-27 Thread Udbhav Agrawal (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729628#comment-16729628
 ] 

Udbhav Agrawal commented on SPARK-26454:


[~dongjoon] [~cloud_fan]

Creating the UDF for the first time using Jar options, HDFS path is converted 
to local path and JAR is added to that path.

Now, Creating the function second time will check if that Jar is present and 
associated with any other path, it will Throw an IllegalArgumentException, and 
continue creating the function.

> IllegegalArgument Exception is Thrown while creating new UDF with JAR
> -
>
> Key: SPARK-26454
> URL: https://issues.apache.org/jira/browse/SPARK-26454
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 2.3.2
>Reporter: Udbhav Agrawal
>Priority: Major
>
> 【Test step】:
> 1.launch spark-shell
> 2. set role admin;
> 3. create new function
>   CREATE FUNCTION Func AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'
> 4. Do select on the function
> sql("select Func('2018-03-09')").show()
> 5.Create new UDF with same JAR
>    sql("CREATE FUNCTION newFunc AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/super_udf/two_udfs.jar'")
> 6. Do select on the new function created.
> sql("select newFunc ('2018-03-09')").show()
> 【Output】:
> Function is getting created but illegal argument exception is thrown , select 
> provides result but with illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26454) IllegegalArgument Exception is Thrown while creating new UDF with JAR

2018-12-27 Thread Udbhav Agrawal (JIRA)
Udbhav Agrawal created SPARK-26454:
--

 Summary: IllegegalArgument Exception is Thrown while creating new 
UDF with JAR
 Key: SPARK-26454
 URL: https://issues.apache.org/jira/browse/SPARK-26454
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell
Affects Versions: 2.3.2
Reporter: Udbhav Agrawal


【Test step】:
1.launch spark-shell
2. set role admin;
3. create new function
  CREATE FUNCTION Func AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/super_udf/two_udfs.jar'
4. Do select on the function
sql("select Func('2018-03-09')").show()
5.Create new UDF with same JAR
   sql("CREATE FUNCTION newFunc AS 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
'hdfs:///tmp/super_udf/two_udfs.jar'")

6. Do select on the new function created.

sql("select newFunc ('2018-03-09')").show()

【Output】:

Function is getting created but illegal argument exception is thrown , select 
provides result but with illegal argument exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26450) Map of schema is built too frequently in some wide queries

2018-12-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729621#comment-16729621
 ] 

Marco Gaido commented on SPARK-26450:
-

Thanks for this JIRA [~bersprockets]. This makes sense to me. Do you want to 
submit a patch for this? Otherwise I can take it over. Thanks.

> Map of schema is built too frequently in some wide queries
> --
>
> Key: SPARK-26450
> URL: https://issues.apache.org/jira/browse/SPARK-26450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When executing queries with wide projections and wide schemas, Spark rebuilds 
> an attribute map for the same schema many times.
> For example:
> {noformat}
> select * from orctbl where id1 = 1
> {noformat}
> Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above 
> query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq 
> instantiation builds a map of the entire list of 6000 attributes (but not 
> until lazy val exprIdToOrdinal is referenced).
> Whenever OrcFileFormat reads a new file, it generates a new unsafe 
> projection. That results in this 
> [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319]
>  getting called:
> {code:java}
> protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] =
> in.map(BindReferences.bindReference(_, inputSchema))
> {code}
> For each column in the projection, this line calls bindReference. Each call 
> passes inputSchema, a Sequence of Attributes, to a parameter position 
> expecting an AttributeSeq. The compiler implicitly calls the constructor for 
> AttributeSeq, which (lazily) builds a map for every attribute in the schema. 
> Therefore, this function builds a map of the entire schema once for each 
> column in the projection, and it does this for each input file. For the above 
> example query, this accounts for 204K instantiations of AttributeSeq.
> Readers for CSV and JSON tables do something similar.
> In addition, ProjectExec also creates an unsafe projection for each task. As 
> a result, this 
> [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91]
>  gets called, which has the same issue:
> {code:java}
>   def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): 
> Seq[Expression] = {
> exprs.map(BindReferences.bindReference(_, inputSchema))
>   }
> {code}
> The above affects all wide queries that have a projection node, regardless of 
> the file reader. For the example query, ProjectExec accounts for the 
> additional 66K instantiations of the AttributeSeq.
> Spark can save time by pre-building the AttributeSeq right before the map 
> operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size 
> of schema, size of projection, number of input files (for Orc), number of 
> file splits (for CSV, and JSON tables), and number of tasks.
> For a 6000 column CSV table with 500K records and 34 input files, the time 
> savings is only 6%[1] because Spark doesn't create as many unsafe projections 
> as compared to Orc tables.
> On the other hand, for a 6000 column Orc table with 500K records and 34 input 
> files, the time savings is about 16%[1].
> [1] based on queries run in local mode with 8 executor threads on my laptop.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25910) accumulator updates from previous stage attempt should not fail

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-25910.
--
Resolution: Duplicate

> accumulator updates from previous stage attempt should not fail
> ---
>
> Key: SPARK-25910
> URL: https://issues.apache.org/jira/browse/SPARK-25910
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26437) Decimal data becomes bigint to query, unable to query

2018-12-27 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729526#comment-16729526
 ] 

Marco Gaido commented on SPARK-26437:
-

cc [~dongjoon]

> Decimal data becomes bigint to query, unable to query
> -
>
> Key: SPARK-26437
> URL: https://issues.apache.org/jira/browse/SPARK-26437
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: zengxl
>Priority: Major
>
> this is my sql:
> create table tmp.tmp_test_6387_1224_spark  stored  as ORCFile  as select 0.00 
> as a
> select a from tmp.tmp_test_6387_1224_spark
> CREATE TABLE `tmp.tmp_test_6387_1224_spark`(
>  {color:#f79232} `a` decimal(2,2)){color}
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
> When I query this table(use hive or sparksql,the exception is same), I throw 
> the following exception information
> *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed 
> stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 
> limit: 0*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)*
>     *at 
> org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26453) running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26453.
--
Resolution: Invalid

Looks a question. Questions should go to mailing list before filing it as an 
issue. You could have a better answer than this.

> running  spark sql cli is looking for wrong path of 
> hive.metastore.warehouse.dir
> 
>
> Key: SPARK-26453
> URL: https://issues.apache.org/jira/browse/SPARK-26453
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: anubhav tarar
>Priority: Major
>
> i started the spark sql cli and run the following sql
> spark-sql> create table cars(make varchar(10));
> it give me below error
> 2018-12-27 14:49:39 ERROR RetryingHMSHandler:159 - 
> MetaException(message:file:*/user/hive/warehouse/*cars is not a directory or 
> unable to create one)
> Note:i have not specify hive.metastore.warehouse.dir anywhere i just 
> downloaded the latest spark version from offical site and try to execute sql
>  
> further more metastore info logs is printed the right location,but looking at 
> above error it seems that *hive.warehouse.metastore.dir* is not pointing to 
> that location
>  
> *2018-12-27 14:49:36 INFO metastore:291 - Mestastore configuration 
> hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
> file:/home/anubhav/Downloads/spark-2.4.0-bin-hadoop2.7/bin/spark-warehouse*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26453) running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir

2018-12-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26453:
-
Target Version/s:   (was: 2.4.0)

> running  spark sql cli is looking for wrong path of 
> hive.metastore.warehouse.dir
> 
>
> Key: SPARK-26453
> URL: https://issues.apache.org/jira/browse/SPARK-26453
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: anubhav tarar
>Priority: Major
>
> i started the spark sql cli and run the following sql
> spark-sql> create table cars(make varchar(10));
> it give me below error
> 2018-12-27 14:49:39 ERROR RetryingHMSHandler:159 - 
> MetaException(message:file:*/user/hive/warehouse/*cars is not a directory or 
> unable to create one)
> Note:i have not specify hive.metastore.warehouse.dir anywhere i just 
> downloaded the latest spark version from offical site and try to execute sql
>  
> further more metastore info logs is printed the right location,but looking at 
> above error it seems that *hive.warehouse.metastore.dir* is not pointing to 
> that location
>  
> *2018-12-27 14:49:36 INFO metastore:291 - Mestastore configuration 
> hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
> file:/home/anubhav/Downloads/spark-2.4.0-bin-hadoop2.7/bin/spark-warehouse*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26191) Control number of truncated fields

2018-12-27 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-26191.
---
   Resolution: Fixed
 Assignee: Maxim Gekk
Fix Version/s: 3.0.0

> Control number of truncated fields
> --
>
> Key: SPARK-26191
> URL: https://issues.apache.org/jira/browse/SPARK-26191
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently, the threshold for truncated fields converted to string can be 
> controlled via global SQL config. Need to add the maxFields parameter to all 
> functions/methods that potentially could produce truncated string from a 
> sequence of fields.
> One of use cases is toFile. This method aims to output not truncated plans. 
> For now users has to set global config to flush whole plans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-27 Thread Peiyu Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729474#comment-16729474
 ] 

Peiyu Zhuang edited comment on SPARK-25299 at 12/27/18 9:32 AM:


Sure.  I just create a [SPIP in google 
doc|https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing].
  Here is our [design 
document|https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit?usp=sharing].

 


was (Author: jealous):
Sure.  I just create a [SPIP in google 
doc|[https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing|https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing].]].
  Here is our [design 
document|https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit?usp=sharing].

 

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26453) running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir

2018-12-27 Thread anubhav tarar (JIRA)
anubhav tarar created SPARK-26453:
-

 Summary: running  spark sql cli is looking for wrong path of 
hive.metastore.warehouse.dir
 Key: SPARK-26453
 URL: https://issues.apache.org/jira/browse/SPARK-26453
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.4.0
Reporter: anubhav tarar


i started the spark sql cli and run the following sql

spark-sql> create table cars(make varchar(10));

it give me below error

2018-12-27 14:49:39 ERROR RetryingHMSHandler:159 - 
MetaException(message:file:*/user/hive/warehouse/*cars is not a directory or 
unable to create one)

Note:i have not specify hive.metastore.warehouse.dir anywhere i just downloaded 
the latest spark version from offical site and try to execute sql

 

further more metastore info logs is printed the right location,but looking at 
above error it seems that *hive.warehouse.metastore.dir* is not pointing to 
that location

 

*2018-12-27 14:49:36 INFO metastore:291 - Mestastore configuration 
hive.metastore.warehouse.dir changed from /user/hive/warehouse to 
file:/home/anubhav/Downloads/spark-2.4.0-bin-hadoop2.7/bin/spark-warehouse*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2018-12-27 Thread Peiyu Zhuang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729474#comment-16729474
 ] 

Peiyu Zhuang commented on SPARK-25299:
--

Sure.  I just create a [SPIP in google 
doc|[https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing|https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing].]].
  Here is our [design 
document|https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit?usp=sharing].

 

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26435) Support creating partitioned table using Hive CTAS by specifying partition column names

2018-12-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26435:
---

Assignee: Liang-Chi Hsieh

> Support creating partitioned table using Hive CTAS by specifying partition 
> column names
> ---
>
> Key: SPARK-26435
> URL: https://issues.apache.org/jira/browse/SPARK-26435
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL 
> syntax. However it is supported by using DataFrameWriter API.
> {code}
> val df = Seq(("a", 1)).toDF("part", "id")
> df.write.format("hive").partitionBy("part").saveAsTable("t")
> {code}
> Hive begins to support this in newer version: 
> https://issues.apache.org/jira/browse/HIVE-20241:
> {code}
> CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part
> {code}
> To match DataFrameWriter API, we should this support to SQL syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26435) Support creating partitioned table using Hive CTAS by specifying partition column names

2018-12-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26435.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23376
[https://github.com/apache/spark/pull/23376]

> Support creating partitioned table using Hive CTAS by specifying partition 
> column names
> ---
>
> Key: SPARK-26435
> URL: https://issues.apache.org/jira/browse/SPARK-26435
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL 
> syntax. However it is supported by using DataFrameWriter API.
> {code}
> val df = Seq(("a", 1)).toDF("part", "id")
> df.write.format("hive").partitionBy("part").saveAsTable("t")
> {code}
> Hive begins to support this in newer version: 
> https://issues.apache.org/jira/browse/HIVE-20241:
> {code}
> CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part
> {code}
> To match DataFrameWriter API, we should this support to SQL syntax.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org