date:20190312

[jira] [Commented] (SPARK-25449) Don't send zero accumulators in heartbeats

2019-03-12 Thread Xiao Li (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791321#comment-16791321
 ] 

Xiao Li commented on SPARK-25449:
-

This changed the unit of conf. 

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25449) Don't send zero accumulators in heartbeats

2019-03-12 Thread Xiao Li (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-25449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25449:

Labels: release-notes  (was: )

> Don't send zero accumulators in heartbeats
> --
>
> Key: SPARK-25449
> URL: https://issues.apache.org/jira/browse/SPARK-25449
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Mukul Murthy
>Assignee: Mukul Murthy
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> Heartbeats sent from executors to the driver every 10 seconds contain metrics 
> and are generally on the order of a few KBs. However, for large jobs with 
> lots of tasks, heartbeats can be on the order of tens of MBs, causing tasks 
> to die with heartbeat failures. We can mitigate this by not sending zero 
> metrics to the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27141) Use ConfigEntry for hardcoded configs Yarn

2019-03-12 Thread Sandeep Katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791284#comment-16791284
 ] 

Sandeep Katta commented on SPARK-27141:
---

I will be working on this PR

> Use ConfigEntry for hardcoded configs Yarn
> --
>
> Key: SPARK-27141
> URL: https://issues.apache.org/jira/browse/SPARK-27141
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: wangjiaochun
>Priority: Major
> Fix For: 3.0.0
>
>
> Some of following Yarn file related configs are not use ConfigEntry value,try 
> to replace them. 
> ApplicationMaster
> YarnAllocatorSuite
> ApplicationMasterSuite
> BaseYarnClusterSuite
> YarnClusterSuite



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27142) Provide REST API for SQL level information

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27142:


Assignee: Apache Spark

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Assignee: Apache Spark
>Priority: Minor
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27143) Provide REST API for JDBC/ODBC level information

2019-03-12 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791272#comment-16791272
 ] 

Ajith S commented on SPARK-27143:
-

ping [~srowen] [~cloud_fan] [~dongjoon] 

Please suggest if this sounds reasonable

> Provide REST API for JDBC/ODBC level information
> 
>
> Key: SPARK-27143
> URL: https://issues.apache.org/jira/browse/SPARK-27143
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Minor
>
> Currently for Monitoring Spark application JDBC/ODBC information is not 
> available from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that JDBC/ODBC level information like session statistics, sql 
> staistics can be provided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27143) Provide REST API for JDBC/ODBC level information

2019-03-12 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791270#comment-16791270
 ] 

Ajith S commented on SPARK-27143:
-

I will be working on this

> Provide REST API for JDBC/ODBC level information
> 
>
> Key: SPARK-27143
> URL: https://issues.apache.org/jira/browse/SPARK-27143
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Minor
>
> Currently for Monitoring Spark application JDBC/ODBC information is not 
> available from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that JDBC/ODBC level information like session statistics, sql 
> staistics can be provided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27143) Provide REST API for JDBC/ODBC level information

2019-03-12 Thread Ajith S (JIRA)

Ajith S created SPARK-27143:
---

 Summary: Provide REST API for JDBC/ODBC level information
 Key: SPARK-27143
 URL: https://issues.apache.org/jira/browse/SPARK-27143
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Ajith S


Currently for Monitoring Spark application JDBC/ODBC information is not 
available from REST but only via UI. REST provides only 
applications,jobs,stages,environment. This Jira is targeted to provide a REST 
API so that JDBC/ODBC level information like session statistics, sql staistics 
can be provided



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27142) Provide REST API for SQL level information

2019-03-12 Thread Ajith S (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791269#comment-16791269
 ] 

Ajith S commented on SPARK-27142:
-

I will be working on this

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Minor
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27142) Provide REST API for SQL level information

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27142:


Assignee: (was: Apache Spark)

> Provide REST API for SQL level information
> --
>
> Key: SPARK-27142
> URL: https://issues.apache.org/jira/browse/SPARK-27142
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ajith S
>Priority: Minor
>
> Currently for Monitoring Spark application SQL information is not available 
> from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that SQL level information can be found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27142) Provide REST API for SQL level information

2019-03-12 Thread Ajith S (JIRA)

Ajith S created SPARK-27142:
---

 Summary: Provide REST API for SQL level information
 Key: SPARK-27142
 URL: https://issues.apache.org/jira/browse/SPARK-27142
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ajith S


Currently for Monitoring Spark application SQL information is not available 
from REST but only via UI. REST provides only 
applications,jobs,stages,environment. This Jira is targeted to provide a REST 
API so that SQL level information can be found



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27141) Use ConfigEntry for hardcoded configs Yarn

2019-03-12 Thread wangjiaochun (JIRA)

wangjiaochun created SPARK-27141:


 Summary: Use ConfigEntry for hardcoded configs Yarn
 Key: SPARK-27141
 URL: https://issues.apache.org/jira/browse/SPARK-27141
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Affects Versions: 3.0.0
Reporter: wangjiaochun
 Fix For: 3.0.0


Some of following Yarn file related configs are not use ConfigEntry value,try 
to replace them. 

ApplicationMaster

YarnAllocatorSuite

ApplicationMasterSuite

BaseYarnClusterSuite

YarnClusterSuite



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26936) On yarn-client mode, insert overwrite local directory can not create temporary path in local staging directory

2019-03-12 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26936:
---
Summary: On yarn-client mode, insert overwrite local directory can not 
create temporary path in local staging directory  (was: insert overwrite local 
directory can not create temporary path in local staging directory)

> On yarn-client mode, insert overwrite local directory can not create 
> temporary path in local staging directory
> --
>
> Key: SPARK-26936
> URL: https://issues.apache.org/jira/browse/SPARK-26936
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> Let me introduce bug of  'insert overwrite local directory'.
> If I execute the SQL mentioned before, a HiveException will appear as follows:
> {code:java}
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at scala.Option.foreach(Option.scala:257)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2037)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
> ... 36 more
> Caused by: org.apache.spark.SparkException: Task failed while writing rows.
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3
>  (exists=false, 
> cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11)
> at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
> at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
> at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
> at 
>

[jira] [Updated] (SPARK-26936) insert overwrite local directory can not create temporary path in local staging directory

2019-03-12 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-26936:
---
Description: 
Let me introduce bug of  'insert overwrite local directory'.

If I execute the SQL mentioned before, a HiveException will appear as follows:
{code:java}
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1599)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1587)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1586)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2037)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:194)
... 36 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
java.io.IOException: Mkdirs failed to create 
file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3
 (exists=false, 
cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11)
at 
org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
at 
org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more
Caused by: java.io.IOException: Mkdirs failed to create 
file:/home/xitong/hive/stagingdir_hive_2019-02-19_17-31-00_678_1816816774691551856-1/-ext-1/_temporary/0/_temporary/attempt_20190219173233_0002_m_00_3
 (exists=false, 
cwd=file:/data10/yarn/nm-local-dir/usercache/xitong/appcache/application_1543893582405_6126857/container_e124_1543893582405_6126857_01_11)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:449)
at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:435)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:928)
at

[jira] [Assigned] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27140:


Assignee: (was: Apache Spark)

> The feature is 'insert overwrite local directory' has an inconsistent 
> behavior in different environment.
> 
>
> Key: SPARK-27140
> URL: https://issues.apache.org/jira/browse/SPARK-27140
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> In local[*] mode, maropu give a test case as follows:
> {code:java}
> $ls /tmp/noexistdir
> ls: /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> scala> spark.table("t").explain
> == Physical Plan ==
> Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> $ls /tmp/noexistdir/t/
> _SUCCESS  part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000
> {code}
> This test case prove spark will create the not exists path and move middle 
> result from local temporary path to created path.This test based on newest 
> master.
> I follow the test case provided by maropu,but find another behavior.
>  I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
>  Inconsistent behavior appears as follows:
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.table("t").explain
> == Physical Plan ==
> HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+ 
>   
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
> ls /tmp/noexistdir/t/
> /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1 
> {code}
> Then I pull the master branch and compile it and deploy it on my hadoop 
> cluster.I get the inconsistent behavior again. The spark version to test is 
> 3.0.0.
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector 
> with the Serial old collector is deprecated and will likely be removed in a 
> future release
> Spark context Web UI available at http://10.198.66.204:55326
> Spark context available as 'sc' (master = local[*], app id = 
> local-1551259036573).
> Spark session available as 'spark'.
> Welcome to spark version 3.0.0-SNAPSHOT
> Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("""select * from t""").show
> +---+---+ 
>   
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
>   
> scala> 
> ll /tmp/noexistdir/t
> -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1
> {code}
> The /tmp/noexistdir/t is a file too.
> I create a PR `https://github.com/apache/spark/pull/23950` used for test the 
> behavior by UT.
> UT results are the same as those of maropu's test, but different from mine.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27140:


Assignee: Apache Spark

> The feature is 'insert overwrite local directory' has an inconsistent 
> behavior in different environment.
> 
>
> Key: SPARK-27140
> URL: https://issues.apache.org/jira/browse/SPARK-27140
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> In local[*] mode, maropu give a test case as follows:
> {code:java}
> $ls /tmp/noexistdir
> ls: /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> scala> spark.table("t").explain
> == Physical Plan ==
> Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> $ls /tmp/noexistdir/t/
> _SUCCESS  part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000
> {code}
> This test case prove spark will create the not exists path and move middle 
> result from local temporary path to created path.This test based on newest 
> master.
> I follow the test case provided by maropu,but find another behavior.
>  I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
>  Inconsistent behavior appears as follows:
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.table("t").explain
> == Physical Plan ==
> HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+ 
>   
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
> ls /tmp/noexistdir/t/
> /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1 
> {code}
> Then I pull the master branch and compile it and deploy it on my hadoop 
> cluster.I get the inconsistent behavior again. The spark version to test is 
> 3.0.0.
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector 
> with the Serial old collector is deprecated and will likely be removed in a 
> future release
> Spark context Web UI available at http://10.198.66.204:55326
> Spark context available as 'sc' (master = local[*], app id = 
> local-1551259036573).
> Spark session available as 'spark'.
> Welcome to spark version 3.0.0-SNAPSHOT
> Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("""select * from t""").show
> +---+---+ 
>   
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
>   
> scala> 
> ll /tmp/noexistdir/t
> -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1
> {code}
> The /tmp/noexistdir/t is a file too.
> I create a PR `https://github.com/apache/spark/pull/23950` used for test the 
> behavior by UT.
> UT results are the same as those of maropu's test, but different from mine.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.

2019-03-12 Thread Apache Spark (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791236#comment-16791236
 ] 

Apache Spark commented on SPARK-27140:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/23950

> The feature is 'insert overwrite local directory' has an inconsistent 
> behavior in different environment.
> 
>
> Key: SPARK-27140
> URL: https://issues.apache.org/jira/browse/SPARK-27140
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0, 3.0.0
>Reporter: jiaan.geng
>Priority: Major
>
> In local[*] mode, maropu give a test case as follows:
> {code:java}
> $ls /tmp/noexistdir
> ls: /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> scala> spark.table("t").explain
> == Physical Plan ==
> Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> $ls /tmp/noexistdir/t/
> _SUCCESS  part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000
> {code}
> This test case prove spark will create the not exists path and move middle 
> result from local temporary path to created path.This test based on newest 
> master.
> I follow the test case provided by maropu,but find another behavior.
>  I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
>  Inconsistent behavior appears as follows:
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> scala> sql("""create table t(c0 int, c1 int)""")
> res0: org.apache.spark.sql.DataFrame = []
> scala> spark.table("t").explain
> == Physical Plan ==
> HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]
> scala> sql("""insert into t values(1, 1)""")
> scala> sql("""select * from t""").show
> +---+---+ 
>   
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
> ls /tmp/noexistdir/t/
> /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1 
> {code}
> Then I pull the master branch and compile it and deploy it on my hadoop 
> cluster.I get the inconsistent behavior again. The spark version to test is 
> 3.0.0.
> {code:java}
> ls /tmp/noexistdir
> ls: cannot access /tmp/noexistdir: No such file or directory
> Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector 
> with the Serial old collector is deprecated and will likely be removed in a 
> future release
> Spark context Web UI available at http://10.198.66.204:55326
> Spark context available as 'sc' (master = local[*], app id = 
> local-1551259036573).
> Spark session available as 'spark'.
> Welcome to spark version 3.0.0-SNAPSHOT
> Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("""select * from t""").show
> +---+---+ 
>   
> | c0| c1|
> +---+---+
> |  1|  1|
> +---+---+
> scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
> from t""")
> res1: org.apache.spark.sql.DataFrame = [] 
>   
> scala> 
> ll /tmp/noexistdir/t
> -rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
> vi /tmp/noexistdir/t
>   1
> {code}
> The /tmp/noexistdir/t is a file too.
> I create a PR `https://github.com/apache/spark/pull/23950` used for test the 
> behavior by UT.
> UT results are the same as those of maropu's test, but different from mine.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.

2019-03-12 Thread jiaan.geng (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-27140:
---
Description: 
In local[*] mode, maropu give a test case as follows:
{code:java}
$ls /tmp/noexistdir
ls: /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
scala> spark.table("t").explain
== Physical Plan ==
Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")

$ls /tmp/noexistdir/t/
_SUCCESS  part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000
{code}
This test case prove spark will create the not exists path and move middle 
result from local temporary path to created path.This test based on newest 
master.

I follow the test case provided by maropu,but find another behavior.
 I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
 Inconsistent behavior appears as follows:
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").explain
== Physical Plan ==
HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+   
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")
res1: org.apache.spark.sql.DataFrame = [] 

ls /tmp/noexistdir/t/
/tmp/noexistdir/t

vi /tmp/noexistdir/t
  1 
{code}
Then I pull the master branch and compile it and deploy it on my hadoop 
cluster.I get the inconsistent behavior again. The spark version to test is 
3.0.0.
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector 
with the Serial old collector is deprecated and will likely be removed in a 
future release
Spark context Web UI available at http://10.198.66.204:55326
Spark context available as 'sc' (master = local[*], app id = 
local-1551259036573).
Spark session available as 'spark'.
Welcome to spark version 3.0.0-SNAPSHOT
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("""select * from t""").show
+---+---+   
| c0| c1|
+---+---+
|  1|  1|
+---+---+


scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")
res1: org.apache.spark.sql.DataFrame = []   

scala> 
ll /tmp/noexistdir/t
-rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
vi /tmp/noexistdir/t
  1
{code}
The /tmp/noexistdir/t is a file too.

I create a PR `https://github.com/apache/spark/pull/23950` used for test the 
behavior by UT.

UT results are the same as those of maropu's test, but different from mine.

 

  was:
In local[*] mode, maropu give a test case as follows:
{code:java}
$ls /tmp/noexistdir
ls: /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
scala> spark.table("t").explain
== Physical Plan ==
Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")

$ls /tmp/noexistdir/t/
_SUCCESS  part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000
{code}
This test case prove spark will create the not exists path and move middle 
result from local temporary path to created path.This test based on newest 
master.

I follow the test case provided by maropu,but find another behavior.
I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
Inconsistent behavior appears as follows:
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").explain
== Physical Plan ==
HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+

[jira] [Resolved] (SPARK-26976) Forbid reserved keywords as identifiers when ANSI mode is on

2019-03-12 Thread Takeshi Yamamuro (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26976.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Resolved by https://github.com/apache/spark/pull/23880

> Forbid reserved keywords as identifiers when ANSI mode is on
> 
>
> Key: SPARK-26976
> URL: https://issues.apache.org/jira/browse/SPARK-26976
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.0.0
>
>
> We need to throw an exception to forbid reserved keywords as identifiers when 
> ANSI mode is on.
> This is a follow-up of SPARK-26215.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27140) The feature is 'insert overwrite local directory' has an inconsistent behavior in different environment.

2019-03-12 Thread jiaan.geng (JIRA)

jiaan.geng created SPARK-27140:
--

 Summary: The feature is 'insert overwrite local directory' has an 
inconsistent behavior in different environment.
 Key: SPARK-27140
 URL: https://issues.apache.org/jira/browse/SPARK-27140
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.0, 3.0.0
Reporter: jiaan.geng


In local[*] mode, maropu give a test case as follows:
{code:java}
$ls /tmp/noexistdir
ls: /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
scala> spark.table("t").explain
== Physical Plan ==
Scan hive default.t [c0#5, c1#6], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")

$ls /tmp/noexistdir/t/
_SUCCESS  part-0-bbea4213-071a-49b4-aac8-8510e7263d45-c000
{code}
This test case prove spark will create the not exists path and move middle 
result from local temporary path to created path.This test based on newest 
master.

I follow the test case provided by maropu,but find another behavior.
I run these SQL maropu provided on local[*] deploy mode based on 2.3.0.
Inconsistent behavior appears as follows:
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory

scala> sql("""create table t(c0 int, c1 int)""")
res0: org.apache.spark.sql.DataFrame = []
scala> spark.table("t").explain
== Physical Plan ==
HiveTableScan [c0#5, c1#6], HiveTableRelation `default`.`t`, 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c0#5, c1#6]

scala> sql("""insert into t values(1, 1)""")
scala> sql("""select * from t""").show
+---+---+   
| c0| c1|
+---+---+
|  1|  1|
+---+---+

scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")
res1: org.apache.spark.sql.DataFrame = [] 

ls /tmp/noexistdir/t/
/tmp/noexistdir/t

vi /tmp/noexistdir/t
  1 
{code}
Then I pull the master branch and compile it and deploy it on my hadoop 
cluster.I get the inconsistent behavior again. The spark version to test is 
3.0.0.
{code:java}
ls /tmp/noexistdir
ls: cannot access /tmp/noexistdir: No such file or directory
Java HotSpot(TM) 64-Bit Server VM warning: Using the ParNew young collector 
with the Serial old collector is deprecated and will likely be removed in a 
future release
Spark context Web UI available at http://10.198.66.204:55326
Spark context available as 'sc' (master = local[*], app id = 
local-1551259036573).
Spark session available as 'spark'.
Welcome to spark version 3.0.0-SNAPSHOT
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_131)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("""select * from t""").show
+---+---+   
| c0| c1|
+---+---+
|  1|  1|
+---+---+


scala> sql("""insert overwrite local directory '/tmp/noexistdir/t' select * 
from t""")
res1: org.apache.spark.sql.DataFrame = []   

scala> 
ll /tmp/noexistdir/t
-rw-r--r-- 1 xitong xitong 0 Feb 27 17:19 /tmp/noexistdir/t
vi /tmp/noexistdir/t
  1
{code}
The /tmp/noexistdir/t is a file too.

So



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27137) Spark captured variable is null if the code is pasted via :paste

2019-03-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27137.
--
Resolution: Cannot Reproduce

Seems not failing in the current master:

{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)

val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)

// Exiting paste mode, now interpreting.

foo: String = foo
f: (arg: Any)Unit
{code}

It would be great if we can identify the JIRA and backport it if applicable.

> Spark captured variable is null if the code is pasted via :paste
> 
>
> Key: SPARK-27137
> URL: https://issues.apache.org/jira/browse/SPARK-27137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Osira Ben
>Priority: Major
>
> If I execute this piece of code
> {code:java}
> val foo = "foo"
> def f(arg: Any): Unit = {
>   Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
> }
> sc.parallelize(Seq(1, 2), 2).foreach(f)
> {code}
> {{in spark2-shell via :paste it throws}}
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val foo = "foo"
> def f(arg: Any): Unit = {
>   Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
> }
> sc.parallelize(Seq(1, 2), 2).foreach(f)
> // Exiting paste mode, now interpreting.
> 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
> (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo
> at java.util.Objects.requireNonNull(Objects.java:228)
> {code}
> However if I execute it pasting without :paste or via spark2-shell -i it 
> doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27131) Merge funciton in QuantileSummaries

2019-03-12 Thread Hyukjin Kwon (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791217#comment-16791217
 ] 

Hyukjin Kwon commented on SPARK-27131:
--

Let's ask questions to mailing list. You could have a better and quicker 
answer. (see https://spark.apache.org/community.html)

> Merge funciton in QuantileSummaries
> ---
>
> Key: SPARK-27131
> URL: https://issues.apache.org/jira/browse/SPARK-27131
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mingchao Tan
>Priority: Minor
>
> In QuantileSummaries.scala file, line 167
> This function merge two QuantileSummaries into one. You merge the two sampled 
> array and then compress it. My question is when compressing the merged array, 
> why you use count instead of the sum of count and other.count? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27131) Merge funciton in QuantileSummaries

2019-03-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27131.
--
Resolution: Invalid

> Merge funciton in QuantileSummaries
> ---
>
> Key: SPARK-27131
> URL: https://issues.apache.org/jira/browse/SPARK-27131
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Mingchao Tan
>Priority: Minor
>
> In QuantileSummaries.scala file, line 167
> This function merge two QuantileSummaries into one. You merge the two sampled 
> array and then compress it. My question is when compressing the merged array, 
> why you use count instead of the sum of count and other.count? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27045) SQL tab in UI shows actual SQL instead of callsite

2019-03-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27045.
---
   Resolution: Fixed
 Assignee: Ajith S
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23958

> SQL tab in UI shows actual SQL instead of callsite
> --
>
> Key: SPARK-27045
> URL: https://issues.apache.org/jira/browse/SPARK-27045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.3.3, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-03-04-18-24-27-469.png, 
> image-2019-03-04-18-24-54-053.png
>
>
> When we run sql in spark ( for example via thrift server), the SparkUI SQL 
> tab must show SQL instead of stacktrace which is more useful to end user. 
> Instead in description column it currently shows the callsite shortform which 
> is less useful
>  Actual:
> !image-2019-03-04-18-24-27-469.png!
>  
> Expected:
> !image-2019-03-04-18-24-54-053.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27045) SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver

2019-03-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27045:
--
Description: 
When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL 
string is siet via {{setJobDescription}}. the SparkUI SQL tab must show SQL 
instead of stacktrace in case {{setJobDescription}} is set which is more useful 
to end user. Instead it currently shows in description column the callsite 
shortform which is less useful

 Actual:

!image-2019-03-04-18-24-27-469.png!

 

Expected:

!image-2019-03-04-18-24-54-053.png!

  was:
When we run sql in spark ( for example via thrift server), the SparkUI SQL tab 
must show SQL instead of stacktrace which is more useful to end user. Instead 
in description column it currently shows the callsite shortform which is less 
useful

 Actual:

!image-2019-03-04-18-24-27-469.png!

 

Expected:

!image-2019-03-04-18-24-54-053.png!


> SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver
> 
>
> Key: SPARK-27045
> URL: https://issues.apache.org/jira/browse/SPARK-27045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.3.3, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-03-04-18-24-27-469.png, 
> image-2019-03-04-18-24-54-053.png
>
>
> When we run sql in spark via SparkSQLDriver (thrift server, spark-sql), SQL 
> string is siet via {{setJobDescription}}. the SparkUI SQL tab must show SQL 
> instead of stacktrace in case {{setJobDescription}} is set which is more 
> useful to end user. Instead it currently shows in description column the 
> callsite shortform which is less useful
>  Actual:
> !image-2019-03-04-18-24-27-469.png!
>  
> Expected:
> !image-2019-03-04-18-24-54-053.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27045) SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver

2019-03-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27045:
--
Summary: SQL tab in UI shows actual SQL instead of callsite in case of 
SparkSQLDriver  (was: SQL tab in UI shows actual SQL instead of callsite)

> SQL tab in UI shows actual SQL instead of callsite in case of SparkSQLDriver
> 
>
> Key: SPARK-27045
> URL: https://issues.apache.org/jira/browse/SPARK-27045
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 2.3.2, 2.3.3, 3.0.0
>Reporter: Ajith S
>Assignee: Ajith S
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-03-04-18-24-27-469.png, 
> image-2019-03-04-18-24-54-053.png
>
>
> When we run sql in spark ( for example via thrift server), the SparkUI SQL 
> tab must show SQL instead of stacktrace which is more useful to end user. 
> Instead in description column it currently shows the callsite shortform which 
> is less useful
>  Actual:
> !image-2019-03-04-18-24-27-469.png!
>  
> Expected:
> !image-2019-03-04-18-24-54-053.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27130) Automatically select profile when executing sbt-checkstyle

2019-03-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-27130.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Fixed in https://github.com/apache/spark/pull/24065

> Automatically select profile when executing sbt-checkstyle
> --
>
> Key: SPARK-27130
> URL: https://issues.apache.org/jira/browse/SPARK-27130
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27130) Automatically select profile when executing sbt-checkstyle

2019-03-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27130:


Assignee: Hyukjin Kwon

> Automatically select profile when executing sbt-checkstyle
> --
>
> Key: SPARK-27130
> URL: https://issues.apache.org/jira/browse/SPARK-27130
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Hyukjin Kwon
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27130) Automatically select profile when executing sbt-checkstyle

2019-03-12 Thread Hyukjin Kwon (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-27130:


Assignee: Yuming Wang  (was: Hyukjin Kwon)

> Automatically select profile when executing sbt-checkstyle
> --
>
> Key: SPARK-27130
> URL: https://issues.apache.org/jira/browse/SPARK-27130
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791075#comment-16791075
 ] 

Dongjoon Hyun commented on SPARK-27107:
---

Thank you for confirmation, [~Dhruve Ashar].

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
>

[jira] [Resolved] (SPARK-27034) Nested schema pruning for ORC

2019-03-12 Thread Dongjoon Hyun (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-27034.
---
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23943

> Nested schema pruning for ORC
> -
>
> Key: SPARK-27034
> URL: https://issues.apache.org/jira/browse/SPARK-27034
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 3.0.0
>
>
> We only support nested schema pruning for Parquet currently. This is opened 
> to propose to support nested schema pruning for ORC too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26176:


Assignee: Apache Spark

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage

[jira] [Assigned] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26176:


Assignee: (was: Apache Spark)

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0 in stage 3.0 failed 1 times, most

[jira] [Commented] (SPARK-26176) Verify column name when creating table via `STORED AS`

2019-03-12 Thread Sujith Chacko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16791060#comment-16791060
 ] 

Sujith Chacko commented on SPARK-26176:
---

Issue is still happening with spark 2.4 latest version. I fixed and raised a PR.

> Verify column name when creating table via `STORED AS`
> --
>
> Key: SPARK-26176
> URL: https://issues.apache.org/jira/browse/SPARK-26176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>  Labels: starter
>
> We can issue a reasonable exception when we creating Parquet native tables, 
> {code:java}
> CREATE TABLE TAB1TEST USING PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: Attribute name "count(ID)" contains 
> invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;
> {code}
> However, the error messages are misleading when we create a table using the 
> Hive serde "STORED AS"
> {code:java}
> CREATE TABLE TAB1TEST STORED AS PARQUET AS SELECT COUNT(ID) FROM TAB1;
> {code}
> {code:java}
> 18/11/26 09:04:44 ERROR SparkSQLDriver: Failed in [CREATE TABLE TAB2TEST 
> stored as parquet AS SELECT COUNT(col1) FROM TAB1]
> org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile(SaveAsHiveFile.scala:97)
>   at 
> org.apache.spark.sql.hive.execution.SaveAsHiveFile.saveAsHiveFile$(SaveAsHiveFile.scala:48)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.saveAsHiveFile(InsertIntoHiveTable.scala:66)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:201)
>   at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:99)
>   at 
> org.apache.spark.sql.hive.execution.CreateHiveTableAsSelectCommand.run(CreateHiveTableAsSelectCommand.scala:86)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
>   at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:113)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:201)
>   at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3270)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3266)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:201)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:655)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:685)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:62)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:371)
>   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:274)
>   at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:852)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:927)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:936)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: org.apache.spark.SparkException:

[jira] [Assigned] (SPARK-27134) array_distinct function does not work correctly with columns containing array of array

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27134:


Assignee: Apache Spark

> array_distinct function does not work correctly with columns containing array 
> of array
> --
>
> Key: SPARK-27134
> URL: https://issues.apache.org/jira/browse/SPARK-27134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4, scala 2.11.11
>Reporter: Mike Trenaman
>Assignee: Apache Spark
>Priority: Major
>
> The array_distinct function introduced in spark 2.4 is producing strange 
> results when used on an array column which contains a nested array. The 
> resulting output can still contain duplicate values, and furthermore, 
> previously distinct values may be removed.
> This is easily repeatable, e.g. with this code:
> val df = Seq(
>  Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
>  ).toDF("Number_Combinations")
> val dfWithDistinct = df.withColumn("distinct_combinations",
>  array_distinct(col("Number_Combinations")))
>  
> The initial 'df' DataFrame contains one row, where column 
> 'Number_Combinations' contains the following values:
> [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]
>  
> The array_distinct function run on this column produces a new column 
> containing the following values:
> [[1, 2], [1, 2], [1, 2]]
>  
> As you can see, this contains three occurrences of the same value (1, 2), and 
> furthermore, the distinct values (3, 4), (4, 5) have been removed.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27123) Improve CollapseProject to handle projects cross limit/repartition/sample

2019-03-12 Thread DB Tsai (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-27123.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24049
[https://github.com/apache/spark/pull/24049]

> Improve CollapseProject to handle projects cross limit/repartition/sample
> -
>
> Key: SPARK-27123
> URL: https://issues.apache.org/jira/browse/SPARK-27123
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> `CollapseProject` optimizer simplifies the plan by merging the adjacent 
> projects and performing alias substitution.
> {code:java}
> scala> sql("SELECT b c FROM (SELECT a b FROM t)").explain
> == Physical Plan ==
> *(1) Project [a#5 AS c#1]
> +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]
> {code}
> We can do that more complex cases like the following.
> *BEFORE*
> {code:java}
> scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM 
> t)").explain
> == Physical Plan ==
> *(2) Project [b#0 AS c#1]
> +- Exchange RoundRobinPartitioning(1)
>+- *(1) Project [a#5 AS b#0]
>   +- Scan hive default.t [a#5], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#5]
> {code}
> *AFTER*
> {code:java}
> scala> sql("SELECT b c FROM (SELECT /*+ REPARTITION(1) */ a b FROM 
> t)").explain
> == Physical Plan ==
> Exchange RoundRobinPartitioning(1)
> +- *(1) Project [a#11 AS c#7]
>+- Scan hive default.t [a#11], HiveTableRelation `default`.`t`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#11]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27134) array_distinct function does not work correctly with columns containing array of array

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27134:


Assignee: (was: Apache Spark)

> array_distinct function does not work correctly with columns containing array 
> of array
> --
>
> Key: SPARK-27134
> URL: https://issues.apache.org/jira/browse/SPARK-27134
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: Spark 2.4, scala 2.11.11
>Reporter: Mike Trenaman
>Priority: Major
>
> The array_distinct function introduced in spark 2.4 is producing strange 
> results when used on an array column which contains a nested array. The 
> resulting output can still contain duplicate values, and furthermore, 
> previously distinct values may be removed.
> This is easily repeatable, e.g. with this code:
> val df = Seq(
>  Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
>  ).toDF("Number_Combinations")
> val dfWithDistinct = df.withColumn("distinct_combinations",
>  array_distinct(col("Number_Combinations")))
>  
> The initial 'df' DataFrame contains one row, where column 
> 'Number_Combinations' contains the following values:
> [[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]
>  
> The array_distinct function run on this column produces a new column 
> containing the following values:
> [[1, 2], [1, 2], [1, 2]]
>  
> As you can see, this contains three occurrences of the same value (1, 2), and 
> furthermore, the distinct values (3, 4), (4, 5) have been removed.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27139) NettyBlockTransferService does not abide by spark.blockManager.port config option

2019-03-12 Thread Bolke de Bruin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bolke de Bruin resolved SPARK-27139.

Resolution: Invalid

And proper casing does the trick

> NettyBlockTransferService does not abide by spark.blockManager.port config 
> option
> -
>
> Key: SPARK-27139
> URL: https://issues.apache.org/jira/browse/SPARK-27139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bolke de Bruin
>Priority: Blocker
>
> This is a regression from a fix in SPARK-4837
> The NettyBlockTransferService always binds to a random port, and does not use 
> the spark.blockManager.port config as specified.
> this is a blocker for tightly controlled environments where random ports are 
> not allowed to pass firewalls.
> neither `spark.driver.blockmanager.port` nor `spark.blockmanager.port` works 
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27139) NettyBlockTransferService does not abide by spark.blockManager.port config option

2019-03-12 Thread Bolke de Bruin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bolke de Bruin updated SPARK-27139:
---
Description: 
This is a regression from a fix in SPARK-4837

The NettyBlockTransferService always binds to a random port, and does not use 
the spark.blockManager.port config as specified.

this is a blocker for tightly controlled environments where random ports are 
not allowed to pass firewalls.

neither `spark.driver.blockmanager.port` nor `spark.blockmanager.port` works 

 

 

 

  was:
This is a regression from a fix in SPARK-4837

The NettyBlockTransferService always binds to a random port, and does not use 
the spark.blockManager.port config as specified.

this is a blocker for tightly controlled environments where random ports are 
not allowed to pass firewalls.

 


> NettyBlockTransferService does not abide by spark.blockManager.port config 
> option
> -
>
> Key: SPARK-27139
> URL: https://issues.apache.org/jira/browse/SPARK-27139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Bolke de Bruin
>Priority: Blocker
>
> This is a regression from a fix in SPARK-4837
> The NettyBlockTransferService always binds to a random port, and does not use 
> the spark.blockManager.port config as specified.
> this is a blocker for tightly controlled environments where random ports are 
> not allowed to pass firewalls.
> neither `spark.driver.blockmanager.port` nor `spark.blockmanager.port` works 
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26927) Race condition may cause dynamic allocation not working

2019-03-12 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-26927:
--

Assignee: liupengcheng

> Race condition may cause dynamic allocation not working
> ---
>
> Key: SPARK-26927
> URL: https://issues.apache.org/jira/browse/SPARK-26927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
> Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26927) Race condition may cause dynamic allocation not working

2019-03-12 Thread Marcelo Vanzin (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-26927.

   Resolution: Fixed
Fix Version/s: 2.3.4
   2.4.2
   3.0.0

Issue resolved by pull request 23842
[https://github.com/apache/spark/pull/23842]

> Race condition may cause dynamic allocation not working
> ---
>
> Key: SPARK-26927
> URL: https://issues.apache.org/jira/browse/SPARK-26927
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
>Reporter: liupengcheng
>Assignee: liupengcheng
>Priority: Major
> Fix For: 3.0.0, 2.4.2, 2.3.4
>
> Attachments: Selection_042.jpg, Selection_043.jpg, Selection_044.jpg, 
> Selection_045.jpg, Selection_046.jpg
>
>
> Recently, we catch a bug that caused our production spark thriftserver hangs:
> There is a race condition in the ExecutorAllocationManager that the 
> `SparkListenerExecutorRemoved` event is posted before the 
> `SparkListenerTaskStart` event, which will cause the incorrect result of 
> `executorIds`, then when some executor idles, the real executors will be 
> removed even executor number is equal to `minNumExecutors` due to the 
> incorrect computation of `newExecutorTotal`(may greater than the 
> `minNumExecutors`), thus may finally causing zero available executors but a 
> wrong number of executorIds was kept in memory.
> What's more, even the `SparkListenerTaskEnd` event can not make the fake 
> `executorIds` released, because later idle event for the fake executors can 
> not cause the real removal of these executors, as they are already removed 
> and they are not exist in the `executorDataMap`  of 
> `CoaseGrainedSchedulerBackend`.
> Logs:
> !Selection_042.jpg!
> !Selection_043.jpg!
> !Selection_044.jpg!
> !Selection_045.jpg!
> !Selection_046.jpg!  
> EventLogs(DisOrder of events):
> {code:java}
> {"Event":"SparkListenerExecutorRemoved","Timestamp":1549936077543,"Executor 
> ID":"131","Removed Reason":"Container 
> container_e28_1547530852233_236191_02_000180 exited from explicit termination 
> request."}
> {"Event":"SparkListenerTaskStart","Stage ID":136689,"Stage Attempt 
> ID":0,"Task Info":{"Task ID":448048,"Index":2,"Attempt":0,"Launch 
> Time":1549936032872,"Executor 
> ID":"131","Host":"mb2-hadoop-prc-st474.awsind","Locality":"RACK_LOCAL", 
> "Speculative":false,"Getting Result Time":0,"Finish 
> Time":1549936032906,"Failed":false,"Killed":false,"Accumulables":[{"ID":12923945,"Name":"internal.metrics.executorDeserializeTime","Update":10,"Value":13,"Internal":true,"Count
>  Faile d 
> Values":true},{"ID":12923946,"Name":"internal.metrics.executorDeserializeCpuTime","Update":2244016,"Value":4286494,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923947,"Name":"internal.metrics.executorRunTime","Update":20,"Val
>  ue":39,"Internal":true,"Count Failed 
> Values":true},{"ID":12923948,"Name":"internal.metrics.executorCpuTime","Update":13412614,"Value":26759061,"Internal":true,"Count
>  Failed Values":true},{"ID":12923949,"Name":"internal.metrics.resultS 
> ize","Update":3578,"Value":7156,"Internal":true,"Count Failed 
> Values":true},{"ID":12923954,"Name":"internal.metrics.peakExecutionMemory","Update":33816576,"Value":67633152,"Internal":true,"Count
>  Failed Values":true},{"ID":12923962,"Na 
> me":"internal.metrics.shuffle.write.bytesWritten","Update":1367,"Value":2774,"Internal":true,"Count
>  Failed 
> Values":true},{"ID":12923963,"Name":"internal.metrics.shuffle.write.recordsWritten","Update":23,"Value":45,"Internal":true,"Cou
>  nt Failed 
> Values":true},{"ID":12923964,"Name":"internal.metrics.shuffle.write.writeTime","Update":3259051,"Value":6858121,"Internal":true,"Count
>  Failed Values":true},{"ID":12921550,"Name":"number of output 
> rows","Update":"158","Value" :"289","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921546,"Name":"number of output 
> rows","Update":"23","Value":"45","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921547,"Name":"peak memo ry total 
> (min, med, 
> max)","Update":"33816575","Value":"67633149","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"},{"ID":12921541,"Name":"data size total (min, 
> med, max)","Update":"551","Value":"1077","Internal":true,"Count Failed 
> Values":true,"Metadata":"sql"}]}}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27139) NettyBlockTransferService does not abide by spark.blockManager.port config option

2019-03-12 Thread Bolke de Bruin (JIRA)

Bolke de Bruin created SPARK-27139:
--

 Summary: NettyBlockTransferService does not abide by 
spark.blockManager.port config option
 Key: SPARK-27139
 URL: https://issues.apache.org/jira/browse/SPARK-27139
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Bolke de Bruin


This is a regression from a fix in SPARK-4837

The NettyBlockTransferService always binds to a random port, and does not use 
the spark.blockManager.port config as specified.

this is a blocker for tightly controlled environments where random ports are 
not allowed to pass firewalls.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dhruve Ashar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790911#comment-16790911
 ] 

Dhruve Ashar commented on SPARK-27107:
--

I verified the changes and we are no longer seeing the issue. Thanks for 
testing+voting the ORC RC. I think I am not on the ORC mailing list, so I might 
have missed the voting. 

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
>

[jira] [Commented] (SPARK-27087) Inability to access to column alias in pyspark

2019-03-12 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790929#comment-16790929
 ] 

Thincrs commented on SPARK-27087:
-

A user of thincrs has selected this issue. Deadline: Tue, Mar 19, 2019 7:41 PM

> Inability to access to column alias in pyspark
> --
>
> Key: SPARK-27087
> URL: https://issues.apache.org/jira/browse/SPARK-27087
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Vincent
>Priority: Minor
>
> In pyspark I have the following:
> {code:java}
> import pyspark.sql.functions as F
> cc = F.lit(1).alias("A")
> print(cc)
> print(cc._jc.toString())
> {code}
> I get :
> {noformat}
> Column
> 1 AS `A`
> {noformat}
> Is there any way for me to just print "A" from cc ? it seems I'm unable to 
> extract the alias programatically from the column object.
> Also I think that in spark-sql in scala, if I print "cc" it would just print 
> "A" instead, so this seem like a bug or a missing feature to me



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27138) Remove AdminUtils calls

2019-03-12 Thread Dylan Guedes (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dylan Guedes updated SPARK-27138:
-
Description: KafkaTestUtils (from kafka010) currently uses AdminUtils to 
create and delete topics for test suites (what is currently deprecated). Since 
it will stop to work at some point, I think that it is a good opportunity to 
change the API calls.  (was: KafkaTestUtils (from kafka010) currently uses 
AdminUtils to create and delete topics for test suites (what is currently 
deprecated). Since it will stop to work at some point, I think that it is a 
good opportunity.)

> Remove AdminUtils calls
> ---
>
> Key: SPARK-27138
> URL: https://issues.apache.org/jira/browse/SPARK-27138
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Dylan Guedes
>Priority: Minor
>
> KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete 
> topics for test suites (what is currently deprecated). Since it will stop to 
> work at some point, I think that it is a good opportunity to change the API 
> calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26251) isnan function not picking non-numeric values

2019-03-12 Thread Thincrs (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790914#comment-16790914
 ] 

Thincrs commented on SPARK-26251:
-

A user of thincrs has selected this issue. Deadline: Tue, Mar 19, 2019 7:38 PM

> isnan function not picking non-numeric values
> -
>
> Key: SPARK-26251
> URL: https://issues.apache.org/jira/browse/SPARK-26251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kunal Rao
>Priority: Minor
>
> import org.apache.spark.sql.functions._
> List("po box 7896", "8907", 
> "435435").toDF("rgid").filter(isnan(col("rgid"))).show
>  
> should pick "po box 7896"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26089) Handle large corrupt shuffle blocks

2019-03-12 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-26089:


Assignee: Ankur Gupta

> Handle large corrupt shuffle blocks
> ---
>
> Key: SPARK-26089
> URL: https://issues.apache.org/jira/browse/SPARK-26089
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Assignee: Ankur Gupta
>Priority: Major
> Fix For: 3.0.0
>
>
> We've seen a bad disk lead to corruption in a shuffle block, which lead to 
> tasks repeatedly failing after fetching the data with an IOException.  The 
> tasks get retried, but the same corrupt data gets fetched again, and the 
> tasks keep failing.  As there isn't a fetch-failure, the jobs eventually 
> fail, spark never tries to regenerate the shuffle data.
> This is the same as SPARK-4105, but that fix only covered small blocks.  
> There was some discussion during that change about this limitation 
> (https://github.com/apache/spark/pull/15923#discussion_r88756017) and 
> followups to cover larger blocks (which would involve spilling to disk to 
> avoid OOM), but it looks like that never happened.
> I can think of a few approaches to this:
> 1) wrap the shuffle block input stream with another input stream, that 
> converts all exceptions into FetchFailures.  This is similar to the fix of 
> SPARK-4105, but that reads the entire input stream up-front, and instead I'm 
> proposing to do it within the InputStream itself so its streaming and does 
> not have a large memory overhead.
> 2) Add checksums to shuffle blocks.  This was proposed 
> [here|https://github.com/apache/spark/pull/15894] and abandoned as being too 
> complex.
> 3) Try to tackle this with blacklisting instead: when there is any failure in 
> a task that is reading shuffle data, assign some "blame" to the source of the 
> shuffle data, and eventually blacklist the source.  It seems really tricky to 
> get sensible heuristics for this, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26089) Handle large corrupt shuffle blocks

2019-03-12 Thread Imran Rashid (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-26089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-26089.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23453
[https://github.com/apache/spark/pull/23453]

> Handle large corrupt shuffle blocks
> ---
>
> Key: SPARK-26089
> URL: https://issues.apache.org/jira/browse/SPARK-26089
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Shuffle, Spark Core
>Affects Versions: 2.4.0
>Reporter: Imran Rashid
>Priority: Major
> Fix For: 3.0.0
>
>
> We've seen a bad disk lead to corruption in a shuffle block, which lead to 
> tasks repeatedly failing after fetching the data with an IOException.  The 
> tasks get retried, but the same corrupt data gets fetched again, and the 
> tasks keep failing.  As there isn't a fetch-failure, the jobs eventually 
> fail, spark never tries to regenerate the shuffle data.
> This is the same as SPARK-4105, but that fix only covered small blocks.  
> There was some discussion during that change about this limitation 
> (https://github.com/apache/spark/pull/15923#discussion_r88756017) and 
> followups to cover larger blocks (which would involve spilling to disk to 
> avoid OOM), but it looks like that never happened.
> I can think of a few approaches to this:
> 1) wrap the shuffle block input stream with another input stream, that 
> converts all exceptions into FetchFailures.  This is similar to the fix of 
> SPARK-4105, but that reads the entire input stream up-front, and instead I'm 
> proposing to do it within the InputStream itself so its streaming and does 
> not have a large memory overhead.
> 2) Add checksums to shuffle blocks.  This was proposed 
> [here|https://github.com/apache/spark/pull/15894] and abandoned as being too 
> complex.
> 3) Try to tackle this with blacklisting instead: when there is any failure in 
> a task that is reading shuffle data, assign some "blame" to the source of the 
> shuffle data, and eventually blacklist the source.  It seems really tricky to 
> get sensible heuristics for this, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27138) Remove AdminUtils calls

2019-03-12 Thread Dylan Guedes (JIRA)

Dylan Guedes created SPARK-27138:


 Summary: Remove AdminUtils calls
 Key: SPARK-27138
 URL: https://issues.apache.org/jira/browse/SPARK-27138
 Project: Spark
  Issue Type: Task
  Components: Tests
Affects Versions: 2.4.0
Reporter: Dylan Guedes


KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete 
topics for test suites (what is currently deprecated). Since it will stop to 
work at some point, I think that it is a good opportunity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27138) Remove AdminUtils calls

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27138:


Assignee: (was: Apache Spark)

> Remove AdminUtils calls
> ---
>
> Key: SPARK-27138
> URL: https://issues.apache.org/jira/browse/SPARK-27138
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Dylan Guedes
>Priority: Minor
>
> KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete 
> topics for test suites (what is currently deprecated). Since it will stop to 
> work at some point, I think that it is a good opportunity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27138) Remove AdminUtils calls

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27138:


Assignee: Apache Spark

> Remove AdminUtils calls
> ---
>
> Key: SPARK-27138
> URL: https://issues.apache.org/jira/browse/SPARK-27138
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Dylan Guedes
>Assignee: Apache Spark
>Priority: Minor
>
> KafkaTestUtils (from kafka010) currently uses AdminUtils to create and delete 
> topics for test suites (what is currently deprecated). Since it will stop to 
> work at some point, I think that it is a good opportunity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27010) find out the actual port number when hive.server2.thrift.port=0

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27010:
-

Assignee: zuotingbing

> find out the actual port number when hive.server2.thrift.port=0
> ---
>
> Key: SPARK-27010
> URL: https://issues.apache.org/jira/browse/SPARK-27010
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Minor
> Attachments: 2019-02-28_170844.png, 2019-02-28_170904.png, 
> 2019-02-28_170942.png, 2019-03-01_092511.png
>
>
> Currently, if we set *hive.server2.thrift.port=0*, it hard to find out the 
> actual port number which one we should use when using beeline to connect..
> before:
> !2019-02-28_170942.png!
> after:
> !2019-02-28_170904.png!
> use beeline to connect success:
> !2019-02-28_170844.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27010) find out the actual port number when hive.server2.thrift.port=0

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27010.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23917
[https://github.com/apache/spark/pull/23917]

> find out the actual port number when hive.server2.thrift.port=0
> ---
>
> Key: SPARK-27010
> URL: https://issues.apache.org/jira/browse/SPARK-27010
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: zuotingbing
>Assignee: zuotingbing
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: 2019-02-28_170844.png, 2019-02-28_170904.png, 
> 2019-02-28_170942.png, 2019-03-01_092511.png
>
>
> Currently, if we set *hive.server2.thrift.port=0*, it hard to find out the 
> actual port number which one we should use when using beeline to connect..
> before:
> !2019-02-28_170942.png!
> after:
> !2019-02-28_170904.png!
> use beeline to connect success:
> !2019-02-28_170844.png!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27090) Removing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27090.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24026
[https://github.com/apache/spark/pull/24026]

> Removing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Shivu Sondur
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27090) Removing old LEGACY_DRIVER_IDENTIFIER ("")

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27090:
-

Assignee: Shivu Sondur

> Removing old LEGACY_DRIVER_IDENTIFIER ("")
> --
>
> Key: SPARK-27090
> URL: https://issues.apache.org/jira/browse/SPARK-27090
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Attila Zsolt Piros
>Assignee: Shivu Sondur
>Priority: Minor
>  Labels: release-notes
>
> For legacy reasons LEGACY_DRIVER_IDENTIFIER was checked for a few places 
> along with the new DRIVER_IDENTIFIER ("driver") to decided whether a driver 
> is running or an executor.
> The new DRIVER_IDENTIFIER ("driver") was introduced in spark version 1.4. So 
> I think we have a chance to get rid of  the LEGACY_DRIVER_IDENTIFIER.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23961) pyspark toLocalIterator throws an exception

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23961:


Assignee: (was: Apache Spark)

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23961) pyspark toLocalIterator throws an exception

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23961:


Assignee: Apache Spark

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Assignee: Apache Spark
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27137) Spark captured variable is null if the code is pasted via :paste

2019-03-12 Thread Osira Ben (JIRA)

Osira Ben created SPARK-27137:
-

 Summary: Spark captured variable is null if the code is pasted via 
:paste
 Key: SPARK-27137
 URL: https://issues.apache.org/jira/browse/SPARK-27137
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Osira Ben


If I execute this piece of code
{code:java}
val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)
{code}
{{in spark2-shell via :paste it throws}}{{}}
{code:java}
scala> :paste
// Entering paste mode (ctrl-D to finish)

val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)

// Exiting paste mode, now interpreting.

19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
(TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo
at java.util.Objects.requireNonNull(Objects.java:228)
{code}
However if I execute it pasting without :paste or via spark2-shell -i it 
doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27137) Spark captured variable is null if the code is pasted via :paste

2019-03-12 Thread Osira Ben (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Osira Ben updated SPARK-27137:
--
Description: 
If I execute this piece of code
{code:java}
val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)
{code}
{{in spark2-shell via :paste it throws}}
{code:java}
scala> :paste
// Entering paste mode (ctrl-D to finish)

val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)

// Exiting paste mode, now interpreting.

19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
(TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo
at java.util.Objects.requireNonNull(Objects.java:228)
{code}
However if I execute it pasting without :paste or via spark2-shell -i it 
doesn't.

  was:
If I execute this piece of code
{code:java}
val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)
{code}
{{in spark2-shell via :paste it throws}}{{}}
{code:java}
scala> :paste
// Entering paste mode (ctrl-D to finish)

val foo = "foo"

def f(arg: Any): Unit = {
  Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
}

sc.parallelize(Seq(1, 2), 2).foreach(f)

// Exiting paste mode, now interpreting.

19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
(TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo
at java.util.Objects.requireNonNull(Objects.java:228)
{code}
However if I execute it pasting without :paste or via spark2-shell -i it 
doesn't.


> Spark captured variable is null if the code is pasted via :paste
> 
>
> Key: SPARK-27137
> URL: https://issues.apache.org/jira/browse/SPARK-27137
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Osira Ben
>Priority: Major
>
> If I execute this piece of code
> {code:java}
> val foo = "foo"
> def f(arg: Any): Unit = {
>   Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
> }
> sc.parallelize(Seq(1, 2), 2).foreach(f)
> {code}
> {{in spark2-shell via :paste it throws}}
> {code:java}
> scala> :paste
> // Entering paste mode (ctrl-D to finish)
> val foo = "foo"
> def f(arg: Any): Unit = {
>   Option(42).foreach(_ => java.util.Objects.requireNonNull(foo, "foo"))
> }
> sc.parallelize(Seq(1, 2), 2).foreach(f)
> // Exiting paste mode, now interpreting.
> 19/03/11 15:02:06 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 
> (TID 2, hadoop.company.com, executor 1): java.lang.NullPointerException: foo
> at java.util.Objects.requireNonNull(Objects.java:228)
> {code}
> However if I execute it pasting without :paste or via spark2-shell -i it 
> doesn't.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27112) Spark Scheduler encounters two independent Deadlocks when trying to kill executors either due to dynamic allocation or blacklisting

2019-03-12 Thread Parth Gandhi (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Parth Gandhi updated SPARK-27112:
-
Description: 
Recently, a few spark users in the organization have reported that their jobs 
were getting stuck. On further analysis, it was found out that there exist two 
independent deadlocks and either of them occur under different circumstances. 
The screenshots for these two deadlocks are attached here. 

We were able to reproduce the deadlocks with the following piece of code:

 
{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

import org.apache.spark._
import org.apache.spark.TaskContext

// Simple example of Word Count in Scala
object ScalaWordCount {
def main(args: Array[String]) {

if (args.length < 2) {
System.err.println("Usage: ScalaWordCount  ")
System.exit(1)
}

val conf = new SparkConf().setAppName("Scala Word Count")
val sc = new SparkContext(conf)

// get the input file uri
val inputFilesUri = args(0)

// get the output file uri
val outputFilesUri = args(1)

while (true) {
val textFile = sc.textFile(inputFilesUri)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => {if (TaskContext.get.partitionId == 5 && 
TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
blacklisting") else (word, 1)})
.reduceByKey(_ + _)
counts.saveAsTextFile(outputFilesUri)
val conf: Configuration = new Configuration()
val path: Path = new Path(outputFilesUri)
val hdfs: FileSystem = FileSystem.get(conf)
hdfs.delete(path, true)
}

sc.stop()
}
}
{code}
 

Additionally, to ensure that the deadlock surfaces up soon enough, I also added 
a small delay in the Spark code here:

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]

 
{code:java}
executorIdToFailureList.remove(exec)
updateNextExpiryTime()
Thread.sleep(2000)
killBlacklistedExecutor(exec)
{code}
 

Also make sure that the following configs are set when launching the above 
spark job:
*spark.blacklist.enabled=true*
*spark.blacklist.killBlacklistedExecutors=true*
*spark.blacklist.application.maxFailedTasksPerExecutor=1*

  was:
Recently, a few spark users in the organization have reported that their jobs 
were getting stuck. On further analysis, it was found out that there exist two 
independent deadlocks and either of them occur under different circumstances. 
The screenshots for these two deadlocks are attached here. 

We were able to reproduce the deadlocks with the following piece of code:

 
{code:java}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}

import org.apache.spark._
import org.apache.spark.TaskContext

// Simple example of Word Count in Scala
object ScalaWordCount {
def main(args: Array[String]) {

if (args.length < 2) {
System.err.println("Usage: ScalaWordCount  ")
System.exit(1)
}

val conf = new SparkConf().setAppName("Scala Word Count")
val sc = new SparkContext(conf)

// get the input file uri
val inputFilesUri = args(0)

// get the output file uri
val outputFilesUri = args(1)

while (true) {
val textFile = sc.textFile(inputFilesUri)
val counts = textFile.flatMap(line => line.split(" "))
.map(word => {if (TaskContext.get.partitionId == 5 && 
TaskContext.get.attemptNumber == 0) throw new Exception("Fail for 
blacklisting") else (word, 1)})
.reduceByKey(_ + _)
counts.saveAsTextFile(outputFilesUri)
val conf: Configuration = new Configuration()
val path: Path = new Path(outputFilesUri)
val hdfs: FileSystem = FileSystem.get(conf)
hdfs.delete(path, true)
}

sc.stop()
}
}
{code}
 

Additionally, to ensure that the deadlock surfaces up soon enough, I also added 
a small delay in the Spark code here:

[https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala#L256]

 
{code:java}
executorIdToFailureList.remove(exec)
updateNextExpiryTime()
Thread.sleep(2000)
killBlacklistedExecutor(exec)
{code}


> Spark Scheduler encounters two independent Deadlocks when trying to kill 
> executors either due to dynamic allocation or blacklisting 
> 
>
> Key: SPARK-27112
> URL: https://issues.apache.org/jira/browse/SPARK-27112
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Parth Gandhi
>Priority: Major
> Attachments: Screen Shot 2019-02-26 at 4.10.26 PM.png, Screen Shot 
> 2019-02-26 at 4.10.48 PM.png, Screen Shot 2019-02-26 at 4.11.11 PM.png, 
> Screen Shot 2019-02-26 at 4.11.26 PM.png
>
>
> Recently, a few spark users in the organization have reported that their jobs 
> were getting stuck. On further analysis, it was found out

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790739#comment-16790739
 ] 

Dongjoon Hyun commented on SPARK-27107:
---

[~Dhruve Ashar]. Please vote on ORC RC1 after testing your environment. :)

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
>

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790672#comment-16790672
 ] 

Dongjoon Hyun commented on SPARK-27107:
---

Yep. [~Dhruve Ashar]. I already tested and voted that. I'll do if the vote 
passes and the artifacts are uploaded. And, we need a test case in Spark side. 
Without a test case, it's just a dependency upgrade. Also, users can simply 
replace their ORC jar files. Apache Spark is not a fat assembly jar file for 
this reason.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
>

[jira] [Comment Edited] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790672#comment-16790672
 ] 

Dongjoon Hyun edited comment on SPARK-27107 at 3/12/19 3:47 PM:


Yep. [~Dhruve Ashar]. I already tested and voted for that ORC RC vote 
yesterday. I'll do if the vote passes and the artifacts are uploaded. And, we 
need a test case in Spark side. Without a test case, it's just a dependency 
upgrade. Also, users can simply replace their ORC jar files without waiting 
Spark releases. Apache Spark is not a fat assembly jar file for this reason.


was (Author: dongjoon):
Yep. [~Dhruve Ashar]. I already tested and voted for that ORC RC vote. I'll do 
if the vote passes and the artifacts are uploaded. And, we need a test case in 
Spark side. Without a test case, it's just a dependency upgrade. Also, users 
can simply replace their ORC jar files. Apache Spark is not a fat assembly jar 
file for this reason.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
>

[jira] [Comment Edited] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dongjoon Hyun (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790672#comment-16790672
 ] 

Dongjoon Hyun edited comment on SPARK-27107 at 3/12/19 3:46 PM:


Yep. [~Dhruve Ashar]. I already tested and voted for that ORC RC vote. I'll do 
if the vote passes and the artifacts are uploaded. And, we need a test case in 
Spark side. Without a test case, it's just a dependency upgrade. Also, users 
can simply replace their ORC jar files. Apache Spark is not a fat assembly jar 
file for this reason.


was (Author: dongjoon):
Yep. [~Dhruve Ashar]. I already tested and voted that. I'll do if the vote 
passes and the artifacts are uploaded. And, we need a test case in Spark side. 
Without a test case, it's just a dependency upgrade. Also, users can simply 
replace their ORC jar files. Apache Spark is not a fat assembly jar file for 
this reason.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
>

[jira] [Assigned] (SPARK-27041) large partition data cause pyspark with python2.x oom

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27041:
-

Assignee: David Yang

> large partition data cause pyspark with python2.x oom
> -
>
> Key: SPARK-27041
> URL: https://issues.apache.org/jira/browse/SPARK-27041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: David Yang
>Assignee: David Yang
>Priority: Major
>
> With large partition, pyspark may exceeds executor memory limit and trigger 
> out of memory for python 2.7.
> This is because map() is used. Unlike in python3.x, python 2.7 map() will 
> generate a list and need to read all data into memory.
> The proposed fix will use imap in python 2.7 and it has been verified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27041) large partition data cause pyspark with python2.x oom

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27041.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23954
[https://github.com/apache/spark/pull/23954]

> large partition data cause pyspark with python2.x oom
> -
>
> Key: SPARK-27041
> URL: https://issues.apache.org/jira/browse/SPARK-27041
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: David Yang
>Assignee: David Yang
>Priority: Major
> Fix For: 3.0.0
>
>
> With large partition, pyspark may exceeds executor memory limit and trigger 
> out of memory for python 2.7.
> This is because map() is used. Unlike in python3.x, python 2.7 map() will 
> generate a list and need to read all data into memory.
> The proposed fix will use imap in python 2.7 and it has been verified.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27125) Add test suite for sql execution page

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27125.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24052
[https://github.com/apache/spark/pull/24052]

> Add test suite for sql execution page
> -
>
> Key: SPARK-27125
> URL: https://issues.apache.org/jira/browse/SPARK-27125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add test suite for sql execution page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27125) Add test suite for sql execution page

2019-03-12 Thread Sean Owen (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27125:
-

Assignee: shahid

> Add test suite for sql execution page
> -
>
> Key: SPARK-27125
> URL: https://issues.apache.org/jira/browse/SPARK-27125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: shahid
>Assignee: shahid
>Priority: Minor
>
> Add test suite for sql execution page



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27107) Spark SQL Job failing because of Kryo buffer overflow with ORC

2019-03-12 Thread Dhruve Ashar (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790610#comment-16790610
 ] 

Dhruve Ashar commented on SPARK-27107:
--

Update: The PR was merged in the orc repository. My understanding is that we 
should update our pom once a new orc release is cut out.

> Spark SQL Job failing because of Kryo buffer overflow with ORC
> --
>
> Key: SPARK-27107
> URL: https://issues.apache.org/jira/browse/SPARK-27107
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Dhruve Ashar
>Priority: Major
>
> The issue occurs while trying to read ORC data and setting the SearchArgument.
> {code:java}
>  Caused by: com.esotericsoftware.kryo.KryoException: Buffer overflow. 
> Available: 0, required: 9
> Serialization trace:
> literalList 
> (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl)
> leaves (org.apache.orc.storage.ql.io.sarg.SearchArgumentImpl)
>   at com.esotericsoftware.kryo.io.Output.require(Output.java:163)
>   at com.esotericsoftware.kryo.io.Output.writeVarLong(Output.java:614)
>   at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:538)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:147)
>   at 
> com.esotericsoftware.kryo.serializers.DefaultSerializers$LongSerializer.write(DefaultSerializers.java:141)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:628)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:100)
>   at 
> com.esotericsoftware.kryo.serializers.CollectionSerializer.write(CollectionSerializer.java:40)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:552)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:80)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:518)
>   at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:534)
>   at 
> org.apache.orc.mapred.OrcInputFormat.setSearchArgument(OrcInputFormat.java:96)
>   at 
> org.apache.orc.mapreduce.OrcInputFormat.setSearchArgument(OrcInputFormat.java:57)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:159)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$1.apply(OrcFileFormat.scala:156)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.buildReaderWithPartitionValues(OrcFileFormat.scala:156)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:297)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:295)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDDs(DataSourceScanExec.scala:315)
>   at 
> org.apache.spark.sql.execution.FilterExec.inputRDDs(basicPhysicalOperators.scala:121)
>   at 
> org.apache.spark.sql.execution.ProjectExec.inputRDDs(basicPhysicalOperators.scala:41)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.python.EvalPythonExec.doExecute(EvalPythonExec.scala:89)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>

[jira] [Commented] (SPARK-23098) Migrate Kafka batch source to v2

2019-03-12 Thread Dylan Guedes (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790585#comment-16790585
 ] 

Dylan Guedes commented on SPARK-23098:
--

Hi,
I would like to work on this one. [~joseph.torres] do you mind in helping me 
with a few suggestions if I get really stuck? Also, is this one similar to the 
CSVReader/JSONReader?

> Migrate Kafka batch source to v2
> 
>
> Key: SPARK-23098
> URL: https://issues.apache.org/jira/browse/SPARK-23098
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23450) jars option in spark submit is documented in misleading way

2019-03-12 Thread Sujith Chacko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790558#comment-16790558
 ] 

Sujith Chacko edited comment on SPARK-23450 at 3/12/19 1:43 PM:


As per my understanding the jar will be distributed to the worker nodes. 
Already tested a UDF scenario where a custom  jar is added to the nodes via 
--jars and executed the UDF query and its working fine


was (Author: s71955):
As pr my understanding the jar will be distributed to the worker nodes. Already 
tested a UDF scenario where a custom  jar is added to the nodes via --jars and 
executed the UDF query and its working fine

> jars option in spark submit is documented in misleading way
> ---
>
> Key: SPARK-23450
> URL: https://issues.apache.org/jira/browse/SPARK-23450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.1
>Reporter: Gregory Reshetniak
>Priority: Major
>
> I am wondering if the {{--jars}} option on spark submit is actually meant for 
> distributing the dependency jars onto the nodes in cluster?
>  
> In my case I can see it working as a "symlink" of sorts. But the 
> documentation is written in the way that suggests otherwise. Please help me 
> figure out if this is a bug or just my reading of the docs. Thanks!
> _
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23450) jars option in spark submit is documented in misleading way

2019-03-12 Thread Sujith Chacko (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790558#comment-16790558
 ] 

Sujith Chacko commented on SPARK-23450:
---

As pr my understanding the jar will be distributed to the worker nodes. Already 
tested a UDF scenario where a custom  jar is added to the nodes via --jars and 
executed the UDF query and its working fine

> jars option in spark submit is documented in misleading way
> ---
>
> Key: SPARK-23450
> URL: https://issues.apache.org/jira/browse/SPARK-23450
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.2.1
>Reporter: Gregory Reshetniak
>Priority: Major
>
> I am wondering if the {{--jars}} option on spark submit is actually meant for 
> distributing the dependency jars onto the nodes in cluster?
>  
> In my case I can see it working as a "symlink" of sorts. But the 
> documentation is written in the way that suggests otherwise. Please help me 
> figure out if this is a bug or just my reading of the docs. Thanks!
> _
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27136) Remove data source option check_files_exist

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27136:


Assignee: Apache Spark

> Remove data source option check_files_exist
> ---
>
> Key: SPARK-27136
> URL: https://issues.apache.org/jira/browse/SPARK-27136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> The data source option check_files_exist is introduced in In #23383 when the 
> file source V2 framework is implemented. In the PR, FileIndex was created as 
> a member of FileTable, so that we could implement partition pruning like 
> 0f9fcab in the future. For file writes, we needed the option to decide 
> whether to check file existence.
> After https://github.com/apache/spark/pull/23774, the option is not needed 
> anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27136) Remove data source option check_files_exist

2019-03-12 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-27136:
---
Description: 
The data source option check_files_exist is introduced in In 
https://github.com/apache/spark/pull/23383 when the file source V2 framework is 
implemented. In the PR, FileIndex was created as a member of FileTable, so that 
we could implement partition pruning like 0f9fcab in the future. At that time 
FileIndexes will always be created for file writes, so we needed the option to 
decide whether to check file existence.



After https://github.com/apache/spark/pull/23774, the option is not needed 
anymore.  This PR is to clean the option.

  was:
The data source option check_files_exist is introduced in In 
https://github.com/apache/spark/pull/23383 when the file source V2 framework is 
implemented. In the PR, FileIndex was created as a member of FileTable, so that 
we could implement partition pruning like 0f9fcab in the future. For file 
writes, we needed the option to decide whether to check file existence.

After https://github.com/apache/spark/pull/23774, the option is not needed 
anymore.  This PR is to clean the option.


> Remove data source option check_files_exist
> ---
>
> Key: SPARK-27136
> URL: https://issues.apache.org/jira/browse/SPARK-27136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The data source option check_files_exist is introduced in In 
> https://github.com/apache/spark/pull/23383 when the file source V2 framework 
> is implemented. In the PR, FileIndex was created as a member of FileTable, so 
> that we could implement partition pruning like 0f9fcab in the future. At that 
> time FileIndexes will always be created for file writes, so we needed the 
> option to decide whether to check file existence.
> After https://github.com/apache/spark/pull/23774, the option is not needed 
> anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27136) Remove data source option check_files_exist

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27136:


Assignee: (was: Apache Spark)

> Remove data source option check_files_exist
> ---
>
> Key: SPARK-27136
> URL: https://issues.apache.org/jira/browse/SPARK-27136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The data source option check_files_exist is introduced in In #23383 when the 
> file source V2 framework is implemented. In the PR, FileIndex was created as 
> a member of FileTable, so that we could implement partition pruning like 
> 0f9fcab in the future. For file writes, we needed the option to decide 
> whether to check file existence.
> After https://github.com/apache/spark/pull/23774, the option is not needed 
> anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27136) Remove data source option check_files_exist

2019-03-12 Thread Gengliang Wang (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-27136:
---
Description: 
The data source option check_files_exist is introduced in In 
https://github.com/apache/spark/pull/23383 when the file source V2 framework is 
implemented. In the PR, FileIndex was created as a member of FileTable, so that 
we could implement partition pruning like 0f9fcab in the future. For file 
writes, we needed the option to decide whether to check file existence.

After https://github.com/apache/spark/pull/23774, the option is not needed 
anymore.  This PR is to clean the option.

  was:
The data source option check_files_exist is introduced in In #23383 when the 
file source V2 framework is implemented. In the PR, FileIndex was created as a 
member of FileTable, so that we could implement partition pruning like 0f9fcab 
in the future. For file writes, we needed the option to decide whether to check 
file existence.

After https://github.com/apache/spark/pull/23774, the option is not needed 
anymore.  This PR is to clean the option.


> Remove data source option check_files_exist
> ---
>
> Key: SPARK-27136
> URL: https://issues.apache.org/jira/browse/SPARK-27136
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> The data source option check_files_exist is introduced in In 
> https://github.com/apache/spark/pull/23383 when the file source V2 framework 
> is implemented. In the PR, FileIndex was created as a member of FileTable, so 
> that we could implement partition pruning like 0f9fcab in the future. For 
> file writes, we needed the option to decide whether to check file existence.
> After https://github.com/apache/spark/pull/23774, the option is not needed 
> anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27136) Remove data source option check_files_exist

2019-03-12 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27136:
--

 Summary: Remove data source option check_files_exist
 Key: SPARK-27136
 URL: https://issues.apache.org/jira/browse/SPARK-27136
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


The data source option check_files_exist is introduced in In #23383 when the 
file source V2 framework is implemented. In the PR, FileIndex was created as a 
member of FileTable, so that we could implement partition pruning like 0f9fcab 
in the future. For file writes, we needed the option to decide whether to check 
file existence.

After https://github.com/apache/spark/pull/23774, the option is not needed 
anymore.  This PR is to clean the option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27135) Description column under jobs does not show complete text if there is overflow

2019-03-12 Thread Sandeep Katta (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta updated SPARK-27135:
--
Summary: Description column under jobs does not show complete text if there 
is overflow  (was: Description column under jobs does not complete text if 
there is overflow)

> Description column under jobs does not show complete text if there is overflow
> --
>
> Key: SPARK-27135
> URL: https://issues.apache.org/jira/browse/SPARK-27135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sandeep Katta
>Priority: Minor
> Attachments: UIIssue.PNG
>
>
> In Spark webUI if the query is big then user cannot see the complete query 
> details even on the mouse over



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27105:


Assignee: Apache Spark

> Prevent exponential complexity in ORC `createFilter`  
> --
>
> Key: SPARK-27105
> URL: https://issues.apache.org/jira/browse/SPARK-27105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ivan Vergiliev
>Assignee: Apache Spark
>Priority: Major
>  Labels: performance
>
> `OrcFilters.createFilters` currently has complexity that's exponential in the 
> height of the filter tree. There are multiple places in Spark that try to 
> prevent the generation of skewed trees so as to not trigger this behaviour, 
> for example:
> - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` 
> combines a number of binary logical expressions into a balanced tree.
> - https://github.com/apache/spark/pull/22313 introduced a change to 
> `OrcFilters` to create a balanced tree instead of a skewed tree.
> However, the underlying exponential behaviour can still be triggered by code 
> paths that don't go through any of the tree balancing methods. For example, 
> if one generates a tree of `Column`s directly in user code, there's nothing 
> in Spark that automatically balances that tree and, hence, skewed trees hit 
> the exponential behaviour. We have hit this in production with jobs 
> mysteriously taking hours on the Spark driver with no worker activity, with 
> as few as ~30 OR filters.
> I have a fix locally that makes the underlying logic have linear complexity 
> instead of exponential complexity. With this fix, the code can handle 
> thousands of filters in milliseconds. I'll send a PR with the fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27105) Prevent exponential complexity in ORC `createFilter`

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27105:


Assignee: (was: Apache Spark)

> Prevent exponential complexity in ORC `createFilter`  
> --
>
> Key: SPARK-27105
> URL: https://issues.apache.org/jira/browse/SPARK-27105
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Ivan Vergiliev
>Priority: Major
>  Labels: performance
>
> `OrcFilters.createFilters` currently has complexity that's exponential in the 
> height of the filter tree. There are multiple places in Spark that try to 
> prevent the generation of skewed trees so as to not trigger this behaviour, 
> for example:
> - `org.apache.spark.sql.catalyst.parser.AstBuilder.visitLogicalBinary` 
> combines a number of binary logical expressions into a balanced tree.
> - https://github.com/apache/spark/pull/22313 introduced a change to 
> `OrcFilters` to create a balanced tree instead of a skewed tree.
> However, the underlying exponential behaviour can still be triggered by code 
> paths that don't go through any of the tree balancing methods. For example, 
> if one generates a tree of `Column`s directly in user code, there's nothing 
> in Spark that automatically balances that tree and, hence, skewed trees hit 
> the exponential behaviour. We have hit this in production with jobs 
> mysteriously taking hours on the Spark driver with no worker activity, with 
> as few as ~30 OR filters.
> I have a fix locally that makes the underlying logic have linear complexity 
> instead of exponential complexity. With this fix, the code can handle 
> thousands of filters in milliseconds. I'll send a PR with the fix soon.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27135) Description column under jobs does not complete text if there is overflow

2019-03-12 Thread Sandeep Katta (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790440#comment-16790440
 ] 

Sandeep Katta commented on SPARK-27135:
---

On hover complete Query should be shown

> Description column under jobs does not complete text if there is overflow
> -
>
> Key: SPARK-27135
> URL: https://issues.apache.org/jira/browse/SPARK-27135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sandeep Katta
>Priority: Minor
> Attachments: UIIssue.PNG
>
>
> In Spark webUI if the query is big then user cannot see the complete query 
> details even on the mouse over



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27135) Description column under jobs does not complete text if there is overflow

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27135:


Assignee: Apache Spark

> Description column under jobs does not complete text if there is overflow
> -
>
> Key: SPARK-27135
> URL: https://issues.apache.org/jira/browse/SPARK-27135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sandeep Katta
>Assignee: Apache Spark
>Priority: Minor
> Attachments: UIIssue.PNG
>
>
> In Spark webUI if the query is big then user cannot see the complete query 
> details even on the mouse over



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27135) Description column under jobs does not complete text if there is overflow

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27135:


Assignee: (was: Apache Spark)

> Description column under jobs does not complete text if there is overflow
> -
>
> Key: SPARK-27135
> URL: https://issues.apache.org/jira/browse/SPARK-27135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sandeep Katta
>Priority: Minor
> Attachments: UIIssue.PNG
>
>
> In Spark webUI if the query is big then user cannot see the complete query 
> details even on the mouse over



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27135) Description column under jobs does not complete text if there is overflow

2019-03-12 Thread Sandeep Katta (JIRA)

Sandeep Katta created SPARK-27135:
-

 Summary: Description column under jobs does not complete text if 
there is overflow
 Key: SPARK-27135
 URL: https://issues.apache.org/jira/browse/SPARK-27135
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.4.0, 2.3.2
Reporter: Sandeep Katta
 Attachments: UIIssue.PNG

In Spark webUI if the query is big then user cannot see the complete query 
details even on the mouse over



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27135) Description column under jobs does not complete text if there is overflow

2019-03-12 Thread Sandeep Katta (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandeep Katta updated SPARK-27135:
--
Attachment: UIIssue.PNG

> Description column under jobs does not complete text if there is overflow
> -
>
> Key: SPARK-27135
> URL: https://issues.apache.org/jira/browse/SPARK-27135
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Sandeep Katta
>Priority: Minor
> Attachments: UIIssue.PNG
>
>
> In Spark webUI if the query is big then user cannot see the complete query 
> details even on the mouse over



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24486) Slow performance reading ArrayType columns

2019-03-12 Thread Luca Canali (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Canali resolved SPARK-24486.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21385) hive-thriftserver register too many listener in listenerbus

2019-03-12 Thread Ekaterina Shurgalina (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-21385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790434#comment-16790434
 ] 

Ekaterina Shurgalina commented on SPARK-21385:
--

Hi [~honestman] !

Is this issue still there? I would like to work on it.

Please, assign it to me.

> hive-thriftserver register too many listener in listenerbus
> ---
>
> Key: SPARK-21385
> URL: https://issues.apache.org/jira/browse/SPARK-21385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.1.0
>Reporter: honestman
>Priority: Minor
>  Labels: easyfix
>
> when set spark.sql.hive.thriftServer.singleSession true, In 
> SparkSQLSessionManager will create new Session for each connection, in the 
> process of create new SqlContext it will  create new SparkListener and 
> register into listenerbus, but the listener is not removed when the hive 
> session closed, so this would cause too many listener register in the 
> listenerbus, this may cause dropEvent in listenerbus.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2019-03-12 Thread Izek Greenfield (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790428#comment-16790428
 ] 

Izek Greenfield edited comment on SPARK-10746 at 3/12/19 10:41 AM:
---

you can implement that by using: 
{code:scala}
import org.apache.spark.sql.functions._
size(collect_set(column).over(window))
{code}



was (Author: igreenfi):
you can implement that by using: 
{code:java}
// Some comments here
size(collect_set(column).over(window))
{code}


> count ( distinct columnref) over () returns wrong result set
> 
>
> Key: SPARK-10746
> URL: https://issues.apache.org/jira/browse/SPARK-10746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Priority: Major
>
> Same issue as report against HIVE (HIVE-9534) 
> Result set was expected to contain 5 rows instead of 1 row as others vendors 
> (ORACLE, Netezza etc) would.
> select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2019-03-12 Thread Izek Greenfield (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790428#comment-16790428
 ] 

Izek Greenfield commented on SPARK-10746:
-

you can implement that by using: 
{code:java}
// Some comments here
size(collect_set(column).over(window))
{code}


> count ( distinct columnref) over () returns wrong result set
> 
>
> Key: SPARK-10746
> URL: https://issues.apache.org/jira/browse/SPARK-10746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Priority: Major
>
> Same issue as report against HIVE (HIVE-9534) 
> Result set was expected to contain 5 rows instead of 1 row as others vendors 
> (ORACLE, Netezza etc) would.
> select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27134) array_distinct function does not work correctly with columns containing array of array

2019-03-12 Thread Mike Trenaman (JIRA)

Mike Trenaman created SPARK-27134:
-

 Summary: array_distinct function does not work correctly with 
columns containing array of array
 Key: SPARK-27134
 URL: https://issues.apache.org/jira/browse/SPARK-27134
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
 Environment: Spark 2.4, scala 2.11.11
Reporter: Mike Trenaman


The array_distinct function introduced in spark 2.4 is producing strange 
results when used on an array column which contains a nested array. The 
resulting output can still contain duplicate values, and furthermore, 
previously distinct values may be removed.

This is easily repeatable, e.g. with this code:

val df = Seq(
 Seq(Seq(1, 2), Seq(1, 2), Seq(1, 2), Seq(3, 4), Seq(4, 5))
 ).toDF("Number_Combinations")

val dfWithDistinct = df.withColumn("distinct_combinations",
 array_distinct(col("Number_Combinations")))

 

The initial 'df' DataFrame contains one row, where column 'Number_Combinations' 
contains the following values:

[[1, 2], [1, 2], [1, 2], [3, 4], [4, 5]]

 

The array_distinct function run on this column produces a new column containing 
the following values:

[[1, 2], [1, 2], [1, 2]]

 

As you can see, this contains three occurrences of the same value (1, 2), and 
furthermore, the distinct values (3, 4), (4, 5) have been removed.

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27132) Improve file source V2 framework

2019-03-12 Thread Gengliang Wang (JIRA)

Gengliang Wang created SPARK-27132:
--

 Summary: Improve file source V2 framework
 Key: SPARK-27132
 URL: https://issues.apache.org/jira/browse/SPARK-27132
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Gengliang Wang


During the migration of CSV V2, I find that we can improve the file source v2 
framework by:
1. check duplicated column names in both read and write
2. Not all the file sources support filter push down. So remove 
`SupportsPushDownFilters` from FileScanBuilder
3. The method `isSplitable` might require data source options. Add a new member 
`options` to FileScan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27133) Refactor the REST based spark app management API to follow the new interface

2019-03-12 Thread Stavros Kontopoulos (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27133:

Summary: Refactor the REST based spark app management API to follow the new 
interface  (was: Refactor the spark app management API based on REST to follow 
the new interface)

> Refactor the REST based spark app management API to follow the new interface
> 
>
> Key: SPARK-27133
> URL: https://issues.apache.org/jira/browse/SPARK-27133
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Based on the discussion 
> [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have 
> introduced in that PR a new interface to manage the `kill` and `list` ops for 
> spark apps specifically for k8s. We should refactor the REST based one for 
> mesos and standalone too, to use that interface. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27133) Refactor the spark app management API based on REST to follow the new interface

2019-03-12 Thread Stavros Kontopoulos (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27133:

Description: Based on the discussion 
[here|https://github.com/apache/spark/pull/23599/files#r254864701], we have 
introduced in that PR a new interface to manage the `kill` and `list` ops for 
spark apps specifically for k8s. We should refactor the REST based one for 
mesos and standalone too, to use that interface.   (was: Based on the 
discussion [here|https://github.com/apache/spark/pull/23599/files#r254864701], 
we have introduced in that PR a new interface to manage the `kill` and `list` 
ops for spark apps specifically for k8s. We should refactor the rest based one 
for mesos and standalone too to use that one. )

> Refactor the spark app management API based on REST to follow the new 
> interface
> ---
>
> Key: SPARK-27133
> URL: https://issues.apache.org/jira/browse/SPARK-27133
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Based on the discussion 
> [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have 
> introduced in that PR a new interface to manage the `kill` and `list` ops for 
> spark apps specifically for k8s. We should refactor the REST based one for 
> mesos and standalone too, to use that interface. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27133) Refactor the spark app management API based on REST to follow the new interface

2019-03-12 Thread Stavros Kontopoulos (JIRA)

Stavros Kontopoulos created SPARK-27133:
---

 Summary: Refactor the spark app management API based on REST to 
follow the new interface
 Key: SPARK-27133
 URL: https://issues.apache.org/jira/browse/SPARK-27133
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Stavros Kontopoulos


Based on the discussion 
[here|https://github.com/apache/spark/pull/23599/files#r254864701] we have 
introduced in that PR a new interface to manage the `kill` and `list` for spark 
apps specifically for k8s. We should refactor the rest based one for mesos and 
standalone too to use that one. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27133) Refactor the spark app management API based on REST to follow the new interface

2019-03-12 Thread Stavros Kontopoulos (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stavros Kontopoulos updated SPARK-27133:

Description: Based on the discussion 
[here|https://github.com/apache/spark/pull/23599/files#r254864701], we have 
introduced in that PR a new interface to manage the `kill` and `list` ops for 
spark apps specifically for k8s. We should refactor the rest based one for 
mesos and standalone too to use that one.   (was: Based on the discussion 
[here|https://github.com/apache/spark/pull/23599/files#r254864701] we have 
introduced in that PR a new interface to manage the `kill` and `list` for spark 
apps specifically for k8s. We should refactor the rest based one for mesos and 
standalone too to use that one. )

> Refactor the spark app management API based on REST to follow the new 
> interface
> ---
>
> Key: SPARK-27133
> URL: https://issues.apache.org/jira/browse/SPARK-27133
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stavros Kontopoulos
>Priority: Major
>
> Based on the discussion 
> [here|https://github.com/apache/spark/pull/23599/files#r254864701], we have 
> introduced in that PR a new interface to manage the `kill` and `list` ops for 
> spark apps specifically for k8s. We should refactor the rest based one for 
> mesos and standalone too to use that one. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27132) Improve file source V2 framework

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27132:


Assignee: (was: Apache Spark)

> Improve file source V2 framework
> 
>
> Key: SPARK-27132
> URL: https://issues.apache.org/jira/browse/SPARK-27132
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> During the migration of CSV V2, I find that we can improve the file source v2 
> framework by:
> 1. check duplicated column names in both read and write
> 2. Not all the file sources support filter push down. So remove 
> `SupportsPushDownFilters` from FileScanBuilder
> 3. The method `isSplitable` might require data source options. Add a new 
> member `options` to FileScan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27132) Improve file source V2 framework

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27132:


Assignee: Apache Spark

> Improve file source V2 framework
> 
>
> Key: SPARK-27132
> URL: https://issues.apache.org/jira/browse/SPARK-27132
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> During the migration of CSV V2, I find that we can improve the file source v2 
> framework by:
> 1. check duplicated column names in both read and write
> 2. Not all the file sources support filter push down. So remove 
> `SupportsPushDownFilters` from FileScanBuilder
> 3. The method `isSplitable` might require data source options. Add a new 
> member `options` to FileScan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24486) Slow performance reading ArrayType columns

2019-03-12 Thread Luca Canali (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16790409#comment-16790409
 ] 

Luca Canali commented on SPARK-24486:
-

Thanks [~yumwang] for looking at this. Indeed I confirm that using collect 
instead of show is faster. In addition, testing on Spark master (March 12 2019) 
I see that show works fast there too (I have not yet looked at which PR fixed 
this).

> Slow performance reading ArrayType columns
> --
>
> Key: SPARK-24486
> URL: https://issues.apache.org/jira/browse/SPARK-24486
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0
>Reporter: Luca Canali
>Priority: Minor
>
> We have found an issue of slow performance in one of our applications when 
> running on Spark 2.3.0 (the same workload does not have a performance issue 
> on Spark 2.2.1). We suspect a regression in the area of handling columns of 
> ArrayType. I have built a simplified test case showing a manifestation of the 
> issue to help with troubleshooting:
>  
>  
> {code:java}
> // prepare test data
> val stringListValues=Range(1,3).mkString(",")
> sql(s"select 1 as myid, Array($stringListValues) as myarray from 
> range(2)").repartition(1).write.parquet("file:///tmp/deleteme1")
> // run test
> spark.read.parquet("file:///tmp/deleteme1").limit(1).show(){code}
> Performance measurements:
>  
> On a desktop-size test system, the test runs in about 2 sec using Spark 2.2.1 
> (runtime goes down to subsecond in subsequent runs) and takes close to 20 sec 
> on Spark 2.3.0
>  
> Additional drill-down using Spark task metrics data, show that in Spark 2.2.1 
> only 2 records are read by this workload, while on Spark 2.3.0 all rows in 
> the file are read, which appears anomalous.
> Example:
> {code:java}
> bin/spark-shell --master local[*] --driver-memory 2g --packages 
> ch.cern.sparkmeasure:spark-measure_2.11:0.11
> val stageMetrics = ch.cern.sparkmeasure.StageMetrics(spark) 
> stageMetrics.runAndMeasure(spark.read.parquet("file:///tmp/deleteme1").limit(1).show())
> {code}
>  
>  
> Selected metrics from Spark 2.3.0 run:
>  
> {noformat}
> elapsedTime => 17849 (18 s)
> sum(numTasks) => 11
> sum(recordsRead) => 2
> sum(bytesRead) => 1136448171 (1083.0 MB){noformat}
>  
>  
> From Spark 2.2.1 run:
>  
> {noformat}
> elapsedTime => 1329 (1 s)
> sum(numTasks) => 2
> sum(recordsRead) => 2
> sum(bytesRead) => 269162610 (256.0 MB)
> {noformat}
>  
> Note: Using Spark built from master (as I write this, June 7th 2018) shows 
> the same behavior as found in Spark 2.3.0
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-27131) Merge funciton in QuantileSummaries

2019-03-12 Thread Mingchao Tan (JIRA)

Mingchao Tan created SPARK-27131:


 Summary: Merge funciton in QuantileSummaries
 Key: SPARK-27131
 URL: https://issues.apache.org/jira/browse/SPARK-27131
 Project: Spark
  Issue Type: Question
  Components: SQL
Affects Versions: 2.4.0
Reporter: Mingchao Tan


In QuantileSummaries.scala file, line 167

This function merge two QuantileSummaries into one. You merge the two sampled 
array and then compress it. My question is when compressing the merged array, 
why you use count instead of the sum of count and other.count? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27130) Automatically select profile when executing sbt-checkstyle

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27130:


Assignee: Apache Spark

> Automatically select profile when executing sbt-checkstyle
> --
>
> Key: SPARK-27130
> URL: https://issues.apache.org/jira/browse/SPARK-27130
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-27130) Automatically select profile when executing sbt-checkstyle

2019-03-12 Thread Apache Spark (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27130:


Assignee: (was: Apache Spark)

> Automatically select profile when executing sbt-checkstyle
> --
>
> Key: SPARK-27130
> URL: https://issues.apache.org/jira/browse/SPARK-27130
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 108 matches

Mail list logo