[jira] [Assigned] (SPARK-23662) Support selective tests in SQLQueryTestSuite

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23662:


Assignee: Apache Spark

> Support selective tests in SQLQueryTestSuite
> 
>
> Key: SPARK-23662
> URL: https://issues.apache.org/jira/browse/SPARK-23662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This ticket targets to support selective tests in `SQLQueryTestSuite`, e.g.,
> {code}
> SPARK_SQL_QUERY_TEST_FILTER=limit.sql,random.sql build/sbt "sql/test-only 
> *SQLQueryTestSuite"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23662) Support selective tests in SQLQueryTestSuite

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23662:


Assignee: (was: Apache Spark)

> Support selective tests in SQLQueryTestSuite
> 
>
> Key: SPARK-23662
> URL: https://issues.apache.org/jira/browse/SPARK-23662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket targets to support selective tests in `SQLQueryTestSuite`, e.g.,
> {code}
> SPARK_SQL_QUERY_TEST_FILTER=limit.sql,random.sql build/sbt "sql/test-only 
> *SQLQueryTestSuite"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23662) Support selective tests in SQLQueryTestSuite

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396562#comment-16396562
 ] 

Apache Spark commented on SPARK-23662:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/20808

> Support selective tests in SQLQueryTestSuite
> 
>
> Key: SPARK-23662
> URL: https://issues.apache.org/jira/browse/SPARK-23662
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket targets to support selective tests in `SQLQueryTestSuite`, e.g.,
> {code}
> SPARK_SQL_QUERY_TEST_FILTER=limit.sql,random.sql build/sbt "sql/test-only 
> *SQLQueryTestSuite"
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23662) Support selective tests in SQLQueryTestSuite

2018-03-12 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-23662:


 Summary: Support selective tests in SQLQueryTestSuite
 Key: SPARK-23662
 URL: https://issues.apache.org/jira/browse/SPARK-23662
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Takeshi Yamamuro


This ticket targets to support selective tests in `SQLQueryTestSuite`, e.g.,
{code}
SPARK_SQL_QUERY_TEST_FILTER=limit.sql,random.sql build/sbt "sql/test-only 
*SQLQueryTestSuite"
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23659) Spark Job gets stuck during shuffle

2018-03-12 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396553#comment-16396553
 ] 

Yuming Wang commented on SPARK-23659:
-

It may be is duplicated by 
[SPARK-18971|https://issues.apache.org/jira/browse/SPARK-18971]. please check 
your netty-all version.

> Spark Job gets stuck during shuffle 
> 
>
> Key: SPARK-23659
> URL: https://issues.apache.org/jira/browse/SPARK-23659
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Mohamed Elagamy
>Priority: Major
>
> Hello Team, 
>  
>    I am running a standalone spark cluster that has 42 nodes each one is 
> considered and executer with 100gb of memory and running an application on it 
> that cluster to do some aggregation, the application get's stuck randomly and 
> that the thread sump, any guidelines would be highly appreciated, thanks 
> community 
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode):
> "Attach Listener" #464538 daemon prio=9 os_prio=0 tid=0x7f2948002800 
> nid=0x63b8 waiting on condition [0x]
>  java.lang.Thread.State: RUNNABLE
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6374" #464493 daemon prio=5 os_prio=0 
> tid=0x7f28005e6000 nid=0x62ea waiting on condition [0x7f2810de8000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6373" #464492 daemon prio=5 os_prio=0 
> tid=0x7f2800023000 nid=0x62e9 waiting on condition [0x7f2812efd000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6372" #464491 daemon prio=5 os_prio=0 
> tid=0x7f28005e7800 nid=0x62e8 waiting on condition [0x7f27c19f]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6371" #464490 daemon prio=5 os_prio=0 
> tid=0x7f28fc007800 nid=0x62e7 waiting on condition [0x7f27c67fc000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 

[jira] [Updated] (SPARK-23523) Incorrect result caused by the rule OptimizeMetadataOnlyQuery

2018-03-12 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23523:

Fix Version/s: 2.3.1

> Incorrect result caused by the rule OptimizeMetadataOnlyQuery
> -
>
> Key: SPARK-23523
> URL: https://issues.apache.org/jira/browse/SPARK-23523
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> {code:scala}
>  val tablePath = new File(s"${path.getCanonicalPath}/cOl3=c/cOl1=a/cOl5=e")
>  Seq(("a", "b", "c", "d", "e")).toDF("cOl1", "cOl2", "cOl3", "cOl4", "cOl5")
>  .write.json(tablePath.getCanonicalPath)
>  val df = spark.read.json(path.getCanonicalPath).select("CoL1", "CoL5", 
> "CoL3").distinct()
>  df.show()
> {code}
> This returns a wrong result 
> {{[c,e,a]}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23660) Yarn throws exception in cluster mode when the application is small

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23660:


Assignee: (was: Apache Spark)

> Yarn throws exception in cluster mode when the application is small
> ---
>
> Key: SPARK-23660
> URL: https://issues.apache.org/jira/browse/SPARK-23660
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Yarn throws the following exception in cluster mode when the application is 
> really small:
> {code:java}
> 18/03/07 23:34:22 WARN netty.NettyRpcEnv: Ignored failure: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7c974942 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@1eea9d2d[Terminated, pool 
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 18/03/07 23:34:22 ERROR yarn.ApplicationMaster: Uncaught exception: 
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.deploy.yarn.YarnAllocator.(YarnAllocator.scala:102)
>   at 
> org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:77)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:450)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:493)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:809)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:834)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 17 more
> 18/03/07 23:34:22 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: 
> Exception thrown in awaitResult: )
> {code}
> Example application:
> {code:java}
> object ExampleApp {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("ExampleApp")
> val sc = new SparkContext(conf)
> try {
>   // Do nothing
> } finally {
>   sc.stop()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23660) Yarn throws exception in cluster mode when the application is small

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23660:


Assignee: Apache Spark

> Yarn throws exception in cluster mode when the application is small
> ---
>
> Key: SPARK-23660
> URL: https://issues.apache.org/jira/browse/SPARK-23660
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Assignee: Apache Spark
>Priority: Minor
>
> Yarn throws the following exception in cluster mode when the application is 
> really small:
> {code:java}
> 18/03/07 23:34:22 WARN netty.NettyRpcEnv: Ignored failure: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7c974942 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@1eea9d2d[Terminated, pool 
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 18/03/07 23:34:22 ERROR yarn.ApplicationMaster: Uncaught exception: 
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.deploy.yarn.YarnAllocator.(YarnAllocator.scala:102)
>   at 
> org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:77)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:450)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:493)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:809)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:834)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 17 more
> 18/03/07 23:34:22 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: 
> Exception thrown in awaitResult: )
> {code}
> Example application:
> {code:java}
> object ExampleApp {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("ExampleApp")
> val sc = new SparkContext(conf)
> try {
>   // Do nothing
> } finally {
>   sc.stop()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23660) Yarn throws exception in cluster mode when the application is small

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396541#comment-16396541
 ] 

Apache Spark commented on SPARK-23660:
--

User 'gaborgsomogyi' has created a pull request for this issue:
https://github.com/apache/spark/pull/20807

> Yarn throws exception in cluster mode when the application is small
> ---
>
> Key: SPARK-23660
> URL: https://issues.apache.org/jira/browse/SPARK-23660
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Yarn throws the following exception in cluster mode when the application is 
> really small:
> {code:java}
> 18/03/07 23:34:22 WARN netty.NettyRpcEnv: Ignored failure: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7c974942 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@1eea9d2d[Terminated, pool 
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 18/03/07 23:34:22 ERROR yarn.ApplicationMaster: Uncaught exception: 
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.deploy.yarn.YarnAllocator.(YarnAllocator.scala:102)
>   at 
> org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:77)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:450)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:493)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:809)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:834)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 17 more
> 18/03/07 23:34:22 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: 
> Exception thrown in awaitResult: )
> {code}
> Example application:
> {code:java}
> object ExampleApp {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("ExampleApp")
> val sc = new SparkContext(conf)
> try {
>   // Do nothing
> } finally {
>   sc.stop()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23661) Implement treeAggregate on Dataset API

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23661:


Assignee: (was: Apache Spark)

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-23661
> URL: https://issues.apache.org/jira/browse/SPARK-23661
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Many algorithms in MLlib are still not migrated their internal computing 
> workload from {{RDD}} to {{DataFrame}}. {{treeAggregate}} is one of obstacles 
> we need to address in order to see complete migration.
> This ticket is opened to provide {{treeAggregate}} on Dataset API. For now 
> this should be a private API used by ML component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23661) Implement treeAggregate on Dataset API

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396538#comment-16396538
 ] 

Apache Spark commented on SPARK-23661:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/20806

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-23661
> URL: https://issues.apache.org/jira/browse/SPARK-23661
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Priority: Major
>
> Many algorithms in MLlib are still not migrated their internal computing 
> workload from {{RDD}} to {{DataFrame}}. {{treeAggregate}} is one of obstacles 
> we need to address in order to see complete migration.
> This ticket is opened to provide {{treeAggregate}} on Dataset API. For now 
> this should be a private API used by ML component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23661) Implement treeAggregate on Dataset API

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23661:


Assignee: Apache Spark

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-23661
> URL: https://issues.apache.org/jira/browse/SPARK-23661
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Many algorithms in MLlib are still not migrated their internal computing 
> workload from {{RDD}} to {{DataFrame}}. {{treeAggregate}} is one of obstacles 
> we need to address in order to see complete migration.
> This ticket is opened to provide {{treeAggregate}} on Dataset API. For now 
> this should be a private API used by ML component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23655) Add support for type aclitem (PostgresDialect)

2018-03-12 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396537#comment-16396537
 ] 

Takeshi Yamamuro commented on SPARK-23655:
--

Is it bad to cast `aclitem` to strings or others in pg, and then load them into 
spark? I feel it is less meaning to support it in spark.

> Add support for type aclitem (PostgresDialect)
> --
>
> Key: SPARK-23655
> URL: https://issues.apache.org/jira/browse/SPARK-23655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Diego da Silva Colombo
>Priority: Major
>
> When I try to load the data of pg_database, an exception occurs:
> `java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003`
> It's happens because the typeName of the column is *aclitem,* and there is no 
> match case for thist type on toCatalystType



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23659) Spark Job gets stuck during shuffle

2018-03-12 Thread Mohamed Elagamy (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396502#comment-16396502
 ] 

Mohamed Elagamy commented on SPARK-23659:
-

that's what's in my class path netty-3.10.6.Final.jar but i thought that spark 
stopped using netty and started using rpc starting from 1.6

> Spark Job gets stuck during shuffle 
> 
>
> Key: SPARK-23659
> URL: https://issues.apache.org/jira/browse/SPARK-23659
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Mohamed Elagamy
>Priority: Major
>
> Hello Team, 
>  
>    I am running a standalone spark cluster that has 42 nodes each one is 
> considered and executer with 100gb of memory and running an application on it 
> that cluster to do some aggregation, the application get's stuck randomly and 
> that the thread sump, any guidelines would be highly appreciated, thanks 
> community 
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode):
> "Attach Listener" #464538 daemon prio=9 os_prio=0 tid=0x7f2948002800 
> nid=0x63b8 waiting on condition [0x]
>  java.lang.Thread.State: RUNNABLE
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6374" #464493 daemon prio=5 os_prio=0 
> tid=0x7f28005e6000 nid=0x62ea waiting on condition [0x7f2810de8000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6373" #464492 daemon prio=5 os_prio=0 
> tid=0x7f2800023000 nid=0x62e9 waiting on condition [0x7f2812efd000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6372" #464491 daemon prio=5 os_prio=0 
> tid=0x7f28005e7800 nid=0x62e8 waiting on condition [0x7f27c19f]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6371" #464490 daemon prio=5 os_prio=0 
> tid=0x7f28fc007800 nid=0x62e7 waiting on condition [0x7f27c67fc000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at 

[jira] [Created] (SPARK-23661) Implement treeAggregate on Dataset API

2018-03-12 Thread Liang-Chi Hsieh (JIRA)
Liang-Chi Hsieh created SPARK-23661:
---

 Summary: Implement treeAggregate on Dataset API
 Key: SPARK-23661
 URL: https://issues.apache.org/jira/browse/SPARK-23661
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.0
Reporter: Liang-Chi Hsieh


Many algorithms in MLlib are still not migrated their internal computing 
workload from {{RDD}} to {{DataFrame}}. {{treeAggregate}} is one of obstacles 
we need to address in order to see complete migration.

This ticket is opened to provide {{treeAggregate}} on Dataset API. For now this 
should be a private API used by ML component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23660) Yarn throws exception in cluster mode when the application is small

2018-03-12 Thread Gabor Somogyi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396388#comment-16396388
 ] 

Gabor Somogyi commented on SPARK-23660:
---

I'm working on it.

> Yarn throws exception in cluster mode when the application is small
> ---
>
> Key: SPARK-23660
> URL: https://issues.apache.org/jira/browse/SPARK-23660
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gabor Somogyi
>Priority: Minor
>
> Yarn throws the following exception in cluster mode when the application is 
> really small:
> {code:java}
> 18/03/07 23:34:22 WARN netty.NettyRpcEnv: Ignored failure: 
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7c974942 
> rejected from 
> java.util.concurrent.ScheduledThreadPoolExecutor@1eea9d2d[Terminated, pool 
> size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
> 18/03/07 23:34:22 ERROR yarn.ApplicationMaster: Uncaught exception: 
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
>   at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
>   at 
> org.apache.spark.deploy.yarn.YarnAllocator.(YarnAllocator.scala:102)
>   at 
> org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:77)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:450)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:493)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:810)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:809)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:834)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped.
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
>   at 
> org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
>   at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
>   at 
> org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
>   at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
>   ... 17 more
> 18/03/07 23:34:22 INFO yarn.ApplicationMaster: Final app status: FAILED, 
> exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: 
> Exception thrown in awaitResult: )
> {code}
> Example application:
> {code:java}
> object ExampleApp {
>   def main(args: Array[String]): Unit = {
> val conf = new SparkConf().setAppName("ExampleApp")
> val sc = new SparkContext(conf)
> try {
>   // Do nothing
> } finally {
>   sc.stop()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23660) Yarn throws exception in cluster mode when the application is small

2018-03-12 Thread Gabor Somogyi (JIRA)
Gabor Somogyi created SPARK-23660:
-

 Summary: Yarn throws exception in cluster mode when the 
application is small
 Key: SPARK-23660
 URL: https://issues.apache.org/jira/browse/SPARK-23660
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 2.3.0
Reporter: Gabor Somogyi


Yarn throws the following exception in cluster mode when the application is 
really small:
{code:java}
18/03/07 23:34:22 WARN netty.NettyRpcEnv: Ignored failure: 
java.util.concurrent.RejectedExecutionException: Task 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7c974942 
rejected from 
java.util.concurrent.ScheduledThreadPoolExecutor@1eea9d2d[Terminated, pool size 
= 0, active threads = 0, queued tasks = 0, completed tasks = 0]
18/03/07 23:34:22 ERROR yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Exception thrown in awaitResult: 
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:76)
at 
org.apache.spark.deploy.yarn.YarnAllocator.(YarnAllocator.scala:102)
at 
org.apache.spark.deploy.yarn.YarnRMClient.register(YarnRMClient.scala:77)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:450)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:493)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:345)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply$mcV$sp(ApplicationMaster.scala:260)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$2.apply(ApplicationMaster.scala:260)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$5.run(ApplicationMaster.scala:810)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:809)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:259)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:834)
at 
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped.
at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:158)
at 
org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135)
at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229)
at 
org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:91)
... 17 more
18/03/07 23:34:22 INFO yarn.ApplicationMaster: Final app status: FAILED, 
exitCode: 13, (reason: Uncaught exception: org.apache.spark.SparkException: 
Exception thrown in awaitResult: )
{code}
Example application:
{code:java}
object ExampleApp {
  def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("ExampleApp")
val sc = new SparkContext(conf)
try {
  // Do nothing
} finally {
  sc.stop()
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20525) ClassCast exception when interpreting UDFs from a String in spark-shell

2018-03-12 Thread UFO (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396372#comment-16396372
 ] 

UFO commented on SPARK-20525:
-

I also met the same problem. Have you solved it?

> ClassCast exception when interpreting UDFs from a String in spark-shell
> ---
>
> Key: SPARK-20525
> URL: https://issues.apache.org/jira/browse/SPARK-20525
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.1.0
> Environment: OS X 10.11.6, spark-2.1.0-bin-hadoop2.7, Scala version 
> 2.11.8 (bundled w/ Spark), Java 1.8.0_121
>Reporter: Dave Knoester
>Priority: Major
> Attachments: UdfTest.scala
>
>
> I'm trying to interpret a string containing Scala code from inside a Spark 
> session. Everything is working fine, except for User Defined Function-like 
> things (UDFs, map, flatMap, etc).  This is a blocker for production launch of 
> a large number of Spark jobs.
> I've been able to boil the problem down to a number of spark-shell examples, 
> shown below.  Because it's reproducible in the spark-shell, these related 
> issues **don't apply**:
> https://issues.apache.org/jira/browse/SPARK-9219
> https://issues.apache.org/jira/browse/SPARK-18075
> https://issues.apache.org/jira/browse/SPARK-19938
> http://apache-spark-developers-list.1001551.n3.nabble.com/This-Exception-has-been-really-hard-to-trace-td19362.html
> https://community.mapr.com/thread/21488-spark-error-scalacollectionseq-in-instance-of-orgapachesparkrddmappartitionsrdd
> https://github.com/scala/bug/issues/9237
> Any help is appreciated!
> 
> Repro: 
> Run each of the below from a spark-shell.  
> Preamble:
> import scala.tools.nsc.GenericRunnerSettings
> import scala.tools.nsc.interpreter.IMain
> val settings = new GenericRunnerSettings( println _ )
> settings.usejavacp.value = true
> val interpreter = new IMain(settings, new java.io.PrintWriter(System.out))
> interpreter.bind("spark", spark);
> These work:
> // works:
> interpreter.interpret("val x = 5")
> // works:
> interpreter.interpret("import spark.implicits._\nval df = 
> spark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.show")
> These do not work:
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = _.toUpperCase\nval upperUDF 
> = 
> udf(upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  upperUDF($\"value\")).show")
> // doesn't work, fails with seq/RDD serialization error:
> interpreter.interpret("import org.apache.spark.sql.functions._\nimport 
> spark.implicits._\nval upper: String => String = 
> _.toUpperCase\nspark.udf.register(\"myUpper\", 
> upper)\nspark.sparkContext.parallelize(Seq(\"foo\",\"bar\")).toDF.withColumn(\"UPPER\",
>  callUDF(\"myUpper\", ($\"value\"))).show")
> The not-working ones fail with this exception:
> Caused by: java.lang.ClassCastException: cannot assign instance of 
> scala.collection.immutable.List$SerializationProxy to field 
> org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type 
> scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
>   at 
> java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
>   at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2237)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2231)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2155)
>   at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2013)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
>   at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
>   at org.apache.spark.scheduler.Task.run(Task.scala:99)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)



--
This message was sent by 

[jira] [Commented] (SPARK-23659) Spark Job gets stuck during shuffle

2018-03-12 Thread Yuming Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396332#comment-16396332
 ] 

Yuming Wang commented on SPARK-23659:
-

Can you please check your {{netty-all}} version?

> Spark Job gets stuck during shuffle 
> 
>
> Key: SPARK-23659
> URL: https://issues.apache.org/jira/browse/SPARK-23659
> Project: Spark
>  Issue Type: Question
>  Components: Input/Output
>Affects Versions: 2.2.0
>Reporter: Mohamed Elagamy
>Priority: Major
>
> Hello Team, 
>  
>    I am running a standalone spark cluster that has 42 nodes each one is 
> considered and executer with 100gb of memory and running an application on it 
> that cluster to do some aggregation, the application get's stuck randomly and 
> that the thread sump, any guidelines would be highly appreciated, thanks 
> community 
> Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode):
> "Attach Listener" #464538 daemon prio=9 os_prio=0 tid=0x7f2948002800 
> nid=0x63b8 waiting on condition [0x]
>  java.lang.Thread.State: RUNNABLE
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6374" #464493 daemon prio=5 os_prio=0 
> tid=0x7f28005e6000 nid=0x62ea waiting on condition [0x7f2810de8000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6373" #464492 daemon prio=5 os_prio=0 
> tid=0x7f2800023000 nid=0x62e9 waiting on condition [0x7f2812efd000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6372" #464491 daemon prio=5 os_prio=0 
> tid=0x7f28005e7800 nid=0x62e8 waiting on condition [0x7f27c19f]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
>  at 
> java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
>  at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
>  at 
> java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Locked ownable synchronizers:
>  - None
> "block-manager-slave-async-thread-pool-6371" #464490 daemon prio=5 os_prio=0 
> tid=0x7f28fc007800 nid=0x62e7 waiting on condition [0x7f27c67fc000]
>  java.lang.Thread.State: TIMED_WAITING (parking)
>  at sun.misc.Unsafe.park(Native Method)
>  - parking to wait for <0x7f2af1ceb2b8> (a 
> java.util.concurrent.SynchronousQueue$TransferStack)
>  at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>  at 
> 

[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2018-03-12 Thread Gesly George (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396281#comment-16396281
 ] 

Gesly George commented on SPARK-4502:
-

What are the chances that this will make it into 2..4.0? For many of our uses 
where we use nested Parquet this would be a huge improvement.

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23659) Spark Job gets stuck during shuffle

2018-03-12 Thread Mohamed Elagamy (JIRA)
Mohamed Elagamy created SPARK-23659:
---

 Summary: Spark Job gets stuck during shuffle 
 Key: SPARK-23659
 URL: https://issues.apache.org/jira/browse/SPARK-23659
 Project: Spark
  Issue Type: Question
  Components: Input/Output
Affects Versions: 2.2.0
Reporter: Mohamed Elagamy


Hello Team, 

 

   I am running a standalone spark cluster that has 42 nodes each one is 
considered and executer with 100gb of memory and running an application on it 
that cluster to do some aggregation, the application get's stuck randomly and 
that the thread sump, any guidelines would be highly appreciated, thanks 
community 



Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode):

"Attach Listener" #464538 daemon prio=9 os_prio=0 tid=0x7f2948002800 
nid=0x63b8 waiting on condition [0x]
 java.lang.Thread.State: RUNNABLE

Locked ownable synchronizers:
 - None

"block-manager-slave-async-thread-pool-6374" #464493 daemon prio=5 os_prio=0 
tid=0x7f28005e6000 nid=0x62ea waiting on condition [0x7f2810de8000]
 java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x7f2af1ceb2b8> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
 at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
 at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
 - None

"block-manager-slave-async-thread-pool-6373" #464492 daemon prio=5 os_prio=0 
tid=0x7f2800023000 nid=0x62e9 waiting on condition [0x7f2812efd000]
 java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x7f2af1ceb2b8> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
 at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
 at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
 - None

"block-manager-slave-async-thread-pool-6372" #464491 daemon prio=5 os_prio=0 
tid=0x7f28005e7800 nid=0x62e8 waiting on condition [0x7f27c19f]
 java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x7f2af1ceb2b8> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
 at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
 at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

Locked ownable synchronizers:
 - None

"block-manager-slave-async-thread-pool-6371" #464490 daemon prio=5 os_prio=0 
tid=0x7f28fc007800 nid=0x62e7 waiting on condition [0x7f27c67fc000]
 java.lang.Thread.State: TIMED_WAITING (parking)
 at sun.misc.Unsafe.park(Native Method)
 - parking to wait for <0x7f2af1ceb2b8> (a 
java.util.concurrent.SynchronousQueue$TransferStack)
 at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:460)
 at 
java.util.concurrent.SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:362)
 at java.util.concurrent.SynchronousQueue.poll(SynchronousQueue.java:941)
 at 
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1066)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
 at 

[jira] [Assigned] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21479:


Assignee: (was: Apache Spark)

> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>Priority: Major
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396235#comment-16396235
 ] 

Apache Spark commented on SPARK-21479:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/20805

> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>Priority: Major
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21479) Outer join filter pushdown in null supplying table when condition is on one of the joined columns

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21479:


Assignee: Apache Spark

> Outer join filter pushdown in null supplying table when condition is on one 
> of the joined columns
> -
>
> Key: SPARK-21479
> URL: https://issues.apache.org/jira/browse/SPARK-21479
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.0, 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>Assignee: Apache Spark
>Priority: Major
>
> Here are two different query plans - 
> {code:java}
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("b = 2").explain()
> == Physical Plan ==
> *Project [a#16299L, b#16295L, c#16300L]
> +- *SortMergeJoin [a#16294L], [a#16299L], Inner
>:- *Sort [a#16294L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16294L, 4)
>: +- *Filter ((isnotnull(b#16295L) && (b#16295L = 2)) && 
> isnotnull(a#16294L))
>:+- Scan ExistingRDD[a#16294L,b#16295L]
>+- *Sort [a#16299L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16299L, 4)
>  +- *Filter isnotnull(a#16299L)
> +- Scan ExistingRDD[a#16299L,c#16300L]
> df1 = spark.createDataFrame([{ "a": 1, "b" : 2}, { "a": 3, "b" : 4}])
> df2 = spark.createDataFrame([{ "a": 1, "c" : 5}, { "a": 3, "c" : 6}, { "a": 
> 5, "c" : 8}])
> df1.join(df2, ['a'], 'right_outer').where("a = 1").explain()
> == Physical Plan ==
> *Project [a#16314L, b#16310L, c#16315L]
> +- SortMergeJoin [a#16309L], [a#16314L], RightOuter
>:- *Sort [a#16309L ASC NULLS FIRST], false, 0
>:  +- Exchange hashpartitioning(a#16309L, 4)
>: +- Scan ExistingRDD[a#16309L,b#16310L]
>+- *Sort [a#16314L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(a#16314L, 4)
>  +- *Filter (isnotnull(a#16314L) && (a#16314L = 1))
> +- Scan ExistingRDD[a#16314L,c#16315L]
> {code}
> If condition on b can be pushed down on df1 then why not condition on a?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23607) Use HDFS extended attributes to store application summary to improve the Spark History Server performance

2018-03-12 Thread Ye Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396193#comment-16396193
 ] 

Ye Zhou commented on SPARK-23607:
-

[~vanzin] Cool. I will post a PR soon. Thanks.

> Use HDFS extended attributes to store application summary to improve the 
> Spark History Server performance
> -
>
> Key: SPARK-23607
> URL: https://issues.apache.org/jira/browse/SPARK-23607
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 2.3.0
>Reporter: Ye Zhou
>Priority: Minor
>
> Currently in Spark History Server, checkForLogs thread will create replaying 
> tasks for log files which have file size change. The replaying task will 
> filter out most of the log file content and keep the application summary 
> including applicationId, user, attemptACL, start time, end time. The 
> application summary data will get updated into listing.ldb and serve the 
> application list on SHS home page. For a long running application, its log 
> file which name ends with "inprogress" will get replayed for multiple times 
> to get these application summary. This is a waste of computing and data 
> reading resource to SHS, which results in the delay for application to get 
> showing up on home page. Internally we have a patch which utilizes HDFS 
> extended attributes to improve the performance for getting application 
> summary in SHS. With this patch, Driver will write the application summary 
> information into extended attributes as key/value. SHS will try to read from 
> extended attributes. If SHS fails to read from extended attributes, it will 
> fall back to read from the log file content as usual. This feature can be 
> enable/disable through configuration.
> It has been running fine for 4 months internally with this patch and the last 
> updated timestamp on SHS keeps within 1 minute as we configure the interval 
> to 1 minute. Originally we had long delay which could be as long as 30 
> minutes in our scale where we have a large number of Spark applications 
> running per day.
> We want to see whether this kind of approach is also acceptable to community. 
> Please comment. If so, I will post a pull request for the changes. Thanks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23658) InProcessAppHandle uses the wrong class in getLogger

2018-03-12 Thread Sahil Takiar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated SPARK-23658:
-
Description: {{InProcessAppHandle}} uses {{ChildProcAppHandle}} as the 
class in {{getLogger}}, it should just use its own name.  (was: 
{{AbstractAppHandle}} and {{InProcessAppHandle}} use {{ChildProcAppHandle}} as 
the class in {{getLogger}}, they should just use their own names.)

> InProcessAppHandle uses the wrong class in getLogger
> 
>
> Key: SPARK-23658
> URL: https://issues.apache.org/jira/browse/SPARK-23658
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Minor
>
> {{InProcessAppHandle}} uses {{ChildProcAppHandle}} as the class in 
> {{getLogger}}, it should just use its own name.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23658) InProcessAppHandle uses the wrong class in getLogger

2018-03-12 Thread Sahil Takiar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sahil Takiar updated SPARK-23658:
-
Summary: InProcessAppHandle uses the wrong class in getLogger  (was: 
InProcessAppHandle and AbstractAppHandle use wrong class in getLogger)

> InProcessAppHandle uses the wrong class in getLogger
> 
>
> Key: SPARK-23658
> URL: https://issues.apache.org/jira/browse/SPARK-23658
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Minor
>
> {{AbstractAppHandle}} and {{InProcessAppHandle}} use {{ChildProcAppHandle}} 
> as the class in {{getLogger}}, they should just use their own names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23658) InProcessAppHandle and AbstractAppHandle use wrong class in getLogger

2018-03-12 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396164#comment-16396164
 ] 

Marcelo Vanzin commented on SPARK-23658:


I already fixed {{AbstractAppHandle}} but missed {{InProcessAppHandle}}.

> InProcessAppHandle and AbstractAppHandle use wrong class in getLogger
> -
>
> Key: SPARK-23658
> URL: https://issues.apache.org/jira/browse/SPARK-23658
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0
>Reporter: Sahil Takiar
>Priority: Minor
>
> {{AbstractAppHandle}} and {{InProcessAppHandle}} use {{ChildProcAppHandle}} 
> as the class in {{getLogger}}, they should just use their own names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23658) InProcessAppHandle and AbstractAppHandle use wrong class in getLogger

2018-03-12 Thread Sahil Takiar (JIRA)
Sahil Takiar created SPARK-23658:


 Summary: InProcessAppHandle and AbstractAppHandle use wrong class 
in getLogger
 Key: SPARK-23658
 URL: https://issues.apache.org/jira/browse/SPARK-23658
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.3.0
Reporter: Sahil Takiar


{{AbstractAppHandle}} and {{InProcessAppHandle}} use {{ChildProcAppHandle}} as 
the class in {{getLogger}}, they should just use their own names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22366) Support ignoreMissingFiles flag parallel to ignoreCorruptFiles

2018-03-12 Thread Shridhar Ramachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396102#comment-16396102
 ] 

Shridhar Ramachandran commented on SPARK-22366:
---

Thanks for this, it's a useful feature. I don't see any documentation for this 
on the public docs (I checked the configuration page as well as the SQL page). 
Could you please update that?

> Support ignoreMissingFiles flag parallel to ignoreCorruptFiles
> --
>
> Key: SPARK-22366
> URL: https://issues.apache.org/jira/browse/SPARK-22366
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Jose Torres
>Assignee: Jose Torres
>Priority: Minor
> Fix For: 2.3.0
>
>
> There's an existing flag "spark.sql.files.ignoreCorruptFiles" that will 
> quietly ignore attempted reads from files that have been corrupted, but it 
> still allows the query to fail on missing files. Being able to ignore missing 
> files too is useful in some replication scenarios.
> We should add a "spark.sql.files.ignoreMissingFiles" to fill out the 
> functionality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23550) Cleanup unused / redundant methods in Utils object

2018-03-12 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23550.

   Resolution: Fixed
 Assignee: Marcelo Vanzin
Fix Version/s: 2.4.0

> Cleanup unused / redundant methods in Utils object
> --
>
> Key: SPARK-23550
> URL: https://issues.apache.org/jira/browse/SPARK-23550
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Trivial
> Fix For: 2.4.0
>
>
> While looking at some code in {{Utils}} for a different purpose, I noticed a 
> bunch of code there that can be removed or otherwise cleaned up.
> I'll send a PR after I run unit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23576) SparkSQL - Decimal data missing decimal point

2018-03-12 Thread Henry Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395803#comment-16395803
 ] 

Henry Robinson commented on SPARK-23576:


Do you have a smaller repro, or does it only reproduce if you create all three 
tables? 

> SparkSQL - Decimal data missing decimal point
> -
>
> Key: SPARK-23576
> URL: https://issues.apache.org/jira/browse/SPARK-23576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
> Environment: spark 2.3.0
> linux
>Reporter: R
>Priority: Major
>
> Integers like 3 stored as a decimal display in sparksql as 300 with 
> no decimal point. But hive displays fine as 3.
> Repro steps:
>  # Create a .csv with the value 3
>  # Use spark to read the csv, cast it as decimal(31,8) and output to an ORC 
> file
>  # Use spark to read the ORC, infer the schema (it will infer 38,18 
> precision) and output to a Parquet file
>  # Create external hive table to read the parquet ( define the hive type as 
> decimal(31,8))
>  # Use spark-sql to select from the external hive table.
>  # Notice how sparksql shows 300    !!!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23412) Add cosine distance measure to BisectingKMeans

2018-03-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-23412:
-

Assignee: Marco Gaido

> Add cosine distance measure to BisectingKMeans
> --
>
> Key: SPARK-23412
> URL: https://issues.apache.org/jira/browse/SPARK-23412
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced cosine distance for KMeans.
> This ticket is to support the cosine distance measure on BisectingKMeans too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23412) Add cosine distance measure to BisectingKMeans

2018-03-12 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23412.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20600
[https://github.com/apache/spark/pull/20600]

> Add cosine distance measure to BisectingKMeans
> --
>
> Key: SPARK-23412
> URL: https://issues.apache.org/jira/browse/SPARK-23412
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> SPARK-22119 introduced cosine distance for KMeans.
> This ticket is to support the cosine distance measure on BisectingKMeans too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23656) Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big endian platform

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23656:


Assignee: Apache Spark

> Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big 
> endian platform
> --
>
> Key: SPARK-23656
> URL: https://issues.apache.org/jira/browse/SPARK-23656
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Assignee: Apache Spark
>Priority: Minor
>
> {{XXH64Suite.testKnownByteArrayInputs()}} performs assertions only on little 
> endian platform, it did not perform them on big endian platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23656) Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big endian platform

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23656:


Assignee: (was: Apache Spark)

> Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big 
> endian platform
> --
>
> Key: SPARK-23656
> URL: https://issues.apache.org/jira/browse/SPARK-23656
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> {{XXH64Suite.testKnownByteArrayInputs()}} performs assertions only on little 
> endian platform, it did not perform them on big endian platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23656) Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big endian platform

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395783#comment-16395783
 ] 

Apache Spark commented on SPARK-23656:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/20804

> Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big 
> endian platform
> --
>
> Key: SPARK-23656
> URL: https://issues.apache.org/jira/browse/SPARK-23656
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> {{XXH64Suite.testKnownByteArrayInputs()}} performs assertions only on little 
> endian platform, it did not perform them on big endian platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-12 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395752#comment-16395752
 ] 

Ryan Blue commented on SPARK-23325:
---

I created SPARK-23657 to track this in parallel. Feel free to comment on what 
needs to be done there.

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23657) Document InternalRow and expose it as a stable interface

2018-03-12 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-23657:
-

 Summary: Document InternalRow and expose it as a stable interface
 Key: SPARK-23657
 URL: https://issues.apache.org/jira/browse/SPARK-23657
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Ryan Blue
 Fix For: 2.4.0


The new DataSourceV2 API needs to stabilize the {{InternalRow}} interface so 
that it can be used by new data source implementations. It already exposes 
{{UnsafeRow}} for reads and {{InternalRow}} for writes, and the representations 
are unlikely to change so this is primarily documentation work.

For more discussion, see SPARK-23325.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23656) Assertion in XXH64Suite is not performed on big endian platform

2018-03-12 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23656:
-
Description: {{XXH64Suite.testKnownByteArrayInputs()}} performs assertions 
only on little endian platform, it did not perform them on big endian platform. 
 (was: {{XXH64Suite.testKnownByteArrayInputs()} performs assertions only on 
little endian platform, it did not perform them on big endian platform.)

> Assertion in XXH64Suite is not performed on big endian platform
> ---
>
> Key: SPARK-23656
> URL: https://issues.apache.org/jira/browse/SPARK-23656
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> {{XXH64Suite.testKnownByteArrayInputs()}} performs assertions only on little 
> endian platform, it did not perform them on big endian platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23656) Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big endian platform

2018-03-12 Thread Kazuaki Ishizaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-23656:
-
Summary: Assertion in XXH64Suite.testKnownByteArrayInputs() is not 
performed on big endian platform  (was: Assertion in XXH64Suite is not 
performed on big endian platform)

> Assertion in XXH64Suite.testKnownByteArrayInputs() is not performed on big 
> endian platform
> --
>
> Key: SPARK-23656
> URL: https://issues.apache.org/jira/browse/SPARK-23656
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: Kazuaki Ishizaki
>Priority: Minor
>
> {{XXH64Suite.testKnownByteArrayInputs()}} performs assertions only on little 
> endian platform, it did not perform them on big endian platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23656) Assertion in XXH64Suite is not performed on big endian platform

2018-03-12 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-23656:


 Summary: Assertion in XXH64Suite is not performed on big endian 
platform
 Key: SPARK-23656
 URL: https://issues.apache.org/jira/browse/SPARK-23656
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.3.0
Reporter: Kazuaki Ishizaki


{{XXH64Suite.testKnownByteArrayInputs()} performs assertions only on little 
endian platform, it did not perform them on big endian platform.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-12 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395686#comment-16395686
 ] 

Anirudh Ramanathan commented on SPARK-23618:


[~felixcheung], just merged this PR. But I'm unable to add an assignee to the 
JIRA. Am I missing some permissions?

> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23618) docker-image-tool.sh Fails While Building Image

2018-03-12 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan resolved SPARK-23618.

Resolution: Fixed

> docker-image-tool.sh Fails While Building Image
> ---
>
> Key: SPARK-23618
> URL: https://issues.apache.org/jira/browse/SPARK-23618
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Ninad Ingole
>Priority: Major
>
> I am trying to build kubernetes image for version 2.3.0, using 
> {code:java}
> ./bin/docker-image-tool.sh -r ninadingole/spark-docker -t v2.3.0 build
> {code}
> giving me an issue for docker build 
> error:
> {code:java}
> "docker build" requires exactly 1 argument.
> See 'docker build --help'.
> Usage: docker build [OPTIONS] PATH | URL | - [flags]
> Build an image from a Dockerfile
> {code}
>  
> Executing the command within the spark distribution directory. Please let me 
> know what's the issue.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23325) DataSourceV2 readers should always produce InternalRow.

2018-03-12 Thread Ryan Blue (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395655#comment-16395655
 ] 

Ryan Blue commented on SPARK-23325:
---

I agree that the binary format is more work and probably out of scope – that's 
more reason to document `InternalRow`.

> DataSourceV2 readers should always produce InternalRow.
> ---
>
> Key: SPARK-23325
> URL: https://issues.apache.org/jira/browse/SPARK-23325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceV2 row-oriented implementations are limited to producing either 
> {{Row}} instances or {{UnsafeRow}} instances by implementing 
> {{SupportsScanUnsafeRow}}. Instead, I think that implementations should 
> always produce {{InternalRow}}.
> The problem with the choice between {{Row}} and {{UnsafeRow}} is that neither 
> one is appropriate for implementers.
> File formats don't produce {{Row}} instances or the data values used by 
> {{Row}}, like {{java.sql.Timestamp}} and {{java.sql.Date}}. An implementation 
> that uses {{Row}} instances must produce data that is immediately translated 
> from the representation that was just produced by Spark. In my experience, it 
> made little sense to translate a timestamp in microseconds to a 
> (milliseconds, nanoseconds) pair, create a {{Timestamp}} instance, and pass 
> that instance to Spark for immediate translation back.
> On the other hand, {{UnsafeRow}} is very difficult to produce unless data is 
> already held in memory. Even the Parquet support built into Spark 
> deserializes to {{InternalRow}} and then uses {{UnsafeProjection}} to produce 
> unsafe rows. When I went to build an implementation that deserializes Parquet 
> or Avro directly to {{UnsafeRow}} (I tried both), I found that it couldn't be 
> done without first deserializing into memory because the size of an array 
> must be known before any values are written.
> I ended up deciding to deserialize to {{InternalRow}} and use 
> {{UnsafeProjection}} to convert to unsafe. There are two problems with this: 
> first, this is Scala and was difficult to call from Java (it required 
> reflection), and second, this causes double projection in the physical plan 
> (a copy for unsafe to unsafe) if there is a projection that wasn't fully 
> pushed to the data source.
> I think the solution is to have a single interface for readers that expects 
> {{InternalRow}}. Then, a projection should be added in the Spark plan to 
> convert to unsafe and avoid projection in the plan and in the data source. If 
> the data source already produces unsafe rows by deserializing directly, this 
> still minimizes the number of copies because the unsafe projection will check 
> whether the incoming data is already {{UnsafeRow}}.
> Using {{InternalRow}} would also match the interface on the write side.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23655) Add support for type aclitem (PostgresDialect)

2018-03-12 Thread Diego da Silva Colombo (JIRA)
Diego da Silva Colombo created SPARK-23655:
--

 Summary: Add support for type aclitem (PostgresDialect)
 Key: SPARK-23655
 URL: https://issues.apache.org/jira/browse/SPARK-23655
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Diego da Silva Colombo


When I try to load the data of pg_database, an exception occurs:

`java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003`

It's happen because the typeName of the column is *aclitem,* and there is no 
match case for thist type on toCatalystType



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23655) Add support for type aclitem (PostgresDialect)

2018-03-12 Thread Diego da Silva Colombo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Diego da Silva Colombo updated SPARK-23655:
---
Description: 
When I try to load the data of pg_database, an exception occurs:

`java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003`

It's happens because the typeName of the column is *aclitem,* and there is no 
match case for thist type on toCatalystType

  was:
When I try to load the data of pg_database, an exception occurs:

`java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003`

It's happen because the typeName of the column is *aclitem,* and there is no 
match case for thist type on toCatalystType


> Add support for type aclitem (PostgresDialect)
> --
>
> Key: SPARK-23655
> URL: https://issues.apache.org/jira/browse/SPARK-23655
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Diego da Silva Colombo
>Priority: Major
>
> When I try to load the data of pg_database, an exception occurs:
> `java.lang.RuntimeException: java.sql.SQLException: Unsupported type 2003`
> It's happens because the typeName of the column is *aclitem,* and there is no 
> match case for thist type on toCatalystType



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23654) cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module as incompatible

2018-03-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23654:
---
Summary: cut jets3t as a dependency of spark-core; exclude it from 
hadoop-cloud module as incompatible  (was: cut jets3t as a dependency of 
spark-core)

> cut jets3t as a dependency of spark-core; exclude it from hadoop-cloud module 
> as incompatible
> -
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23654) cut jets3t as a dependency of spark-core

2018-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395451#comment-16395451
 ] 

Steve Loughran commented on SPARK-23654:


SPARK-22634 highights that the spark-hadoop-cloud module can't include jets3t 
and expect spark to be consistent w.r.t bouncy castle.

Since you can't get s3n/s3 to work right now anyway, excluding jets3t from the 
hadoop-cloud module will at least stop people trying to use it

> cut jets3t as a dependency of spark-core
> 
>
> Key: SPARK-23654
> URL: https://issues.apache.org/jira/browse/SPARK-23654
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Steve Loughran
>Priority: Minor
>
> Spark core declares a dependency on Jets3t, which pulls in other cruft
> # the hadoop-cloud module pulls in the hadoop-aws module with the 
> jets3t-compatible connectors, and the relevant dependencies: the spark-core 
> dependency is incomplete if that module isn't built, and superflous or 
> inconsistent if it is.
> # We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 
> 3.x in favour we're willing to maintain.
> JetS3t was wonderful when it came out, but now the amazon SDKs massively 
> exceed it in functionality, albeit at the expense of week-to-week stability 
> and JAR binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23652) Verify error when using ASF s3:// connector. & Jetty 0.9.4

2018-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395449#comment-16395449
 ] 

Steve Loughran commented on SPARK-23652:


this stack trace is just HADOOP-11086; tagging as a duplicate. And it comes 
from updating jets3t, which was done in SPARK-22634. Filed SPARK-23654 to make 
this problem go away completely.

> Verify error when using ASF s3:// connector. & Jetty 0.9.4
> --
>
> Key: SPARK-23652
> URL: https://issues.apache.org/jira/browse/SPARK-23652
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Abhishek Shrivastava
>Priority: Minor
>
> In below spark-shell I am trying to connect to S3 and load file to create 
> dataframe:
>  
> {{spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 scala> val 
> sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> 
> sc.hadoopConfiguration.set("fs.s3a.access.key", "") scala> 
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "") scala> val weekly = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("delimiter", ",").load("s3://usr_bucket/data/file.csv") scala> 
> print(weekly) scala> weekly.show()}}
>  
>  
> {{Error:}}
> {{java.lang.VerifyError: Bad type on operand stack Exception Details: 
> Location: 
> org/apache/hadoop/fs/s3/Jets3tFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V
>  @43: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' 
> (current frame, stack[3]) is not assignable to 
> 'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @43 
> flags: \{ } locals: \{ 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', 
> 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 
> 'org/apache/hadoop/fs/s3/S3Credentials', 
> 'org/jets3t/service/security/AWSCredentials' } stack: \{ 
> 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', uninitialized 37, 
> uninitialized 37, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 
> 000: 2a2c b500 02bb 0003 59b7 0004 4e2d 2b2c 010: b600 05bb 0006 592d 
> b600 072d b600 08b7 020: 0009 3a04 2abb 000a 5919 04b7 000b b500 030: 
> 0ca7 0023 3a04 1904 b600 0ec1 000f 9900 040: 0c19 04b6 000e c000 0fbf 
> bb00 1059 1904 050: b700 11bf 2abb 0012 592b b600 13b7 0014 060: b500 
> 152a 2c12 1611 1000 b600 17b5 0018 070: b1 Exception Handler Table: bci 
> [19, 49] => handler: 52 Stackmap Table: 
> full_frame(@52,\{Object[#194],Object[#195],Object[#196],Object[#197]},\{Object[#198]})
>  append_frame(@74,Object[#198]) chop_frame(@84,1) at 
> org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
>  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:109) at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816) at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1307) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
> 

[jira] [Updated] (SPARK-23652) Verify error when using ASF s3:// connector. & Jetty 0.9.4

2018-03-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23652:
---
Priority: Minor  (was: Critical)

> Verify error when using ASF s3:// connector. & Jetty 0.9.4
> --
>
> Key: SPARK-23652
> URL: https://issues.apache.org/jira/browse/SPARK-23652
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Abhishek Shrivastava
>Priority: Minor
>
> In below spark-shell I am trying to connect to S3 and load file to create 
> dataframe:
>  
> {{spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 scala> val 
> sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> 
> sc.hadoopConfiguration.set("fs.s3a.access.key", "") scala> 
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "") scala> val weekly = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("delimiter", ",").load("s3://usr_bucket/data/file.csv") scala> 
> print(weekly) scala> weekly.show()}}
>  
>  
> {{Error:}}
> {{java.lang.VerifyError: Bad type on operand stack Exception Details: 
> Location: 
> org/apache/hadoop/fs/s3/Jets3tFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V
>  @43: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' 
> (current frame, stack[3]) is not assignable to 
> 'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @43 
> flags: \{ } locals: \{ 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', 
> 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 
> 'org/apache/hadoop/fs/s3/S3Credentials', 
> 'org/jets3t/service/security/AWSCredentials' } stack: \{ 
> 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', uninitialized 37, 
> uninitialized 37, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 
> 000: 2a2c b500 02bb 0003 59b7 0004 4e2d 2b2c 010: b600 05bb 0006 592d 
> b600 072d b600 08b7 020: 0009 3a04 2abb 000a 5919 04b7 000b b500 030: 
> 0ca7 0023 3a04 1904 b600 0ec1 000f 9900 040: 0c19 04b6 000e c000 0fbf 
> bb00 1059 1904 050: b700 11bf 2abb 0012 592b b600 13b7 0014 060: b500 
> 152a 2c12 1611 1000 b600 17b5 0018 070: b1 Exception Handler Table: bci 
> [19, 49] => handler: 52 Stackmap Table: 
> full_frame(@52,\{Object[#194],Object[#195],Object[#196],Object[#197]},\{Object[#198]})
>  append_frame(@74,Object[#198]) chop_frame(@84,1) at 
> org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
>  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:109) at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816) at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1307) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
> org.apache.spark.rdd.RDD.take(RDD.scala:1302) at 
> org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1342) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> 

[jira] [Updated] (SPARK-23652) Verify error when using ASF s3:// connector. & Jetty 0.9.4

2018-03-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23652:
---
Summary: Verify error when using ASF s3:// connector. & Jetty 0.9.4  (was: 
Verify error when using ASF s3:// connector.)

> Verify error when using ASF s3:// connector. & Jetty 0.9.4
> --
>
> Key: SPARK-23652
> URL: https://issues.apache.org/jira/browse/SPARK-23652
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Abhishek Shrivastava
>Priority: Critical
>
> In below spark-shell I am trying to connect to S3 and load file to create 
> dataframe:
>  
> {{spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 scala> val 
> sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> 
> sc.hadoopConfiguration.set("fs.s3a.access.key", "") scala> 
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "") scala> val weekly = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("delimiter", ",").load("s3://usr_bucket/data/file.csv") scala> 
> print(weekly) scala> weekly.show()}}
>  
>  
> {{Error:}}
> {{java.lang.VerifyError: Bad type on operand stack Exception Details: 
> Location: 
> org/apache/hadoop/fs/s3/Jets3tFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V
>  @43: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' 
> (current frame, stack[3]) is not assignable to 
> 'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @43 
> flags: \{ } locals: \{ 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', 
> 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 
> 'org/apache/hadoop/fs/s3/S3Credentials', 
> 'org/jets3t/service/security/AWSCredentials' } stack: \{ 
> 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', uninitialized 37, 
> uninitialized 37, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 
> 000: 2a2c b500 02bb 0003 59b7 0004 4e2d 2b2c 010: b600 05bb 0006 592d 
> b600 072d b600 08b7 020: 0009 3a04 2abb 000a 5919 04b7 000b b500 030: 
> 0ca7 0023 3a04 1904 b600 0ec1 000f 9900 040: 0c19 04b6 000e c000 0fbf 
> bb00 1059 1904 050: b700 11bf 2abb 0012 592b b600 13b7 0014 060: b500 
> 152a 2c12 1611 1000 b600 17b5 0018 070: b1 Exception Handler Table: bci 
> [19, 49] => handler: 52 Stackmap Table: 
> full_frame(@52,\{Object[#194],Object[#195],Object[#196],Object[#197]},\{Object[#198]})
>  append_frame(@74,Object[#198]) chop_frame(@84,1) at 
> org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
>  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:109) at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816) at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1307) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
> org.apache.spark.rdd.RDD.take(RDD.scala:1302) at 
> 

[jira] [Updated] (SPARK-23652) Verify error when using ASF s3:// connector.

2018-03-12 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-23652:
---
Summary: Verify error when using ASF s3:// connector.  (was: Spark 
Connection with S3)

> Verify error when using ASF s3:// connector.
> 
>
> Key: SPARK-23652
> URL: https://issues.apache.org/jira/browse/SPARK-23652
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Abhishek Shrivastava
>Priority: Critical
>
> In below spark-shell I am trying to connect to S3 and load file to create 
> dataframe:
>  
> {{spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 scala> val 
> sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> 
> sc.hadoopConfiguration.set("fs.s3a.access.key", "") scala> 
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "") scala> val weekly = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("delimiter", ",").load("s3://usr_bucket/data/file.csv") scala> 
> print(weekly) scala> weekly.show()}}
>  
>  
> {{Error:}}
> {{java.lang.VerifyError: Bad type on operand stack Exception Details: 
> Location: 
> org/apache/hadoop/fs/s3/Jets3tFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V
>  @43: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' 
> (current frame, stack[3]) is not assignable to 
> 'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @43 
> flags: \{ } locals: \{ 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', 
> 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 
> 'org/apache/hadoop/fs/s3/S3Credentials', 
> 'org/jets3t/service/security/AWSCredentials' } stack: \{ 
> 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', uninitialized 37, 
> uninitialized 37, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 
> 000: 2a2c b500 02bb 0003 59b7 0004 4e2d 2b2c 010: b600 05bb 0006 592d 
> b600 072d b600 08b7 020: 0009 3a04 2abb 000a 5919 04b7 000b b500 030: 
> 0ca7 0023 3a04 1904 b600 0ec1 000f 9900 040: 0c19 04b6 000e c000 0fbf 
> bb00 1059 1904 050: b700 11bf 2abb 0012 592b b600 13b7 0014 060: b500 
> 152a 2c12 1611 1000 b600 17b5 0018 070: b1 Exception Handler Table: bci 
> [19, 49] => handler: 52 Stackmap Table: 
> full_frame(@52,\{Object[#194],Object[#195],Object[#196],Object[#197]},\{Object[#198]})
>  append_frame(@74,Object[#198]) chop_frame(@84,1) at 
> org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
>  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:109) at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816) at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1307) at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
> org.apache.spark.rdd.RDD.take(RDD.scala:1302) at 
> org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1342) at 
> 

[jira] [Created] (SPARK-23654) cut jets3t as a dependency of spark-core

2018-03-12 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-23654:
--

 Summary: cut jets3t as a dependency of spark-core
 Key: SPARK-23654
 URL: https://issues.apache.org/jira/browse/SPARK-23654
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: Steve Loughran


Spark core declares a dependency on Jets3t, which pulls in other cruft

# the hadoop-cloud module pulls in the hadoop-aws module with the 
jets3t-compatible connectors, and the relevant dependencies: the spark-core 
dependency is incomplete if that module isn't built, and superflous or 
inconsistent if it is.
# We've cut out s3n/s3 and all dependencies on jets3t entirely from hadoop 3.x 
in favour we're willing to maintain.

JetS3t was wonderful when it came out, but now the amazon SDKs massively exceed 
it in functionality, albeit at the expense of week-to-week stability and JAR 
binary compatibility



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23652) Spark Connection with S3

2018-03-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395428#comment-16395428
 ] 

Steve Loughran commented on SPARK-23652:


Don't use the s3:// connector which ships with the ASF (opposed to the EMR) 
spark releases, use the "s3a" connector which (a) interoperates with other 
clients of the S3 bucket (b), is maintained and (c) doesn't use jets3t. your 
stack trace will go away.

this isn't actually a spark problem except in the general classpath level, its 
happending in the hadoop lib. But were you file it there it'd be closed as a 
WONTFIX as s3 and s3n have both been cut from the codebase, leaving only that 
s3a connector whose authentication parameters you've actually been setting



> Spark Connection with S3
> 
>
> Key: SPARK-23652
> URL: https://issues.apache.org/jira/browse/SPARK-23652
> Project: Spark
>  Issue Type: Question
>  Components: Spark Shell, Spark Submit
>Affects Versions: 1.6.0
>Reporter: Abhishek Shrivastava
>Priority: Critical
>
> In below spark-shell I am trying to connect to S3 and load file to create 
> dataframe:
>  
> {{spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 scala> val 
> sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> 
> sc.hadoopConfiguration.set("fs.s3a.access.key", "") scala> 
> sc.hadoopConfiguration.set("fs.s3a.secret.key", "") scala> val weekly = 
> sqlContext.read.format("com.databricks.spark.csv").option("header", 
> "true").option("delimiter", ",").load("s3://usr_bucket/data/file.csv") scala> 
> print(weekly) scala> weekly.show()}}
>  
>  
> {{Error:}}
> {{java.lang.VerifyError: Bad type on operand stack Exception Details: 
> Location: 
> org/apache/hadoop/fs/s3/Jets3tFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V
>  @43: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' 
> (current frame, stack[3]) is not assignable to 
> 'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @43 
> flags: \{ } locals: \{ 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', 
> 'java/net/URI', 'org/apache/hadoop/conf/Configuration', 
> 'org/apache/hadoop/fs/s3/S3Credentials', 
> 'org/jets3t/service/security/AWSCredentials' } stack: \{ 
> 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', uninitialized 37, 
> uninitialized 37, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 
> 000: 2a2c b500 02bb 0003 59b7 0004 4e2d 2b2c 010: b600 05bb 0006 592d 
> b600 072d b600 08b7 020: 0009 3a04 2abb 000a 5919 04b7 000b b500 030: 
> 0ca7 0023 3a04 1904 b600 0ec1 000f 9900 040: 0c19 04b6 000e c000 0fbf 
> bb00 1059 1904 050: b700 11bf 2abb 0012 592b b600 13b7 0014 060: b500 
> 152a 2c12 1611 1000 b600 17b5 0018 070: b1 Exception Handler Table: bci 
> [19, 49] => handler: 52 Stackmap Table: 
> full_frame(@52,\{Object[#194],Object[#195],Object[#196],Object[#197]},\{Object[#198]})
>  append_frame(@74,Object[#198]) chop_frame(@84,1) at 
> org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119)
>  at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:109) at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816) at 
> org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at 
> org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at 
> org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at 
> org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
> org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
>  at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
> at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
> at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
> scala.Option.getOrElse(Option.scala:120) at 
> 

[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395408#comment-16395408
 ] 

Josiah Berkebile commented on SPARK-5997:
-

Maybe it's worth mentioning that I've had to do this for several NLP-related 
Spark jobs because of the amount of RAM it consumes to build-out the parsing 
tree.  Increasing the partition count proportional to the total row count so 
that each partition had roughly 'X' number of rows both increased parallelism 
over the cluster and also allowed me to reduce pressure on the RAM/Heap 
utilization.  In these sorts of scenarios, exact balance across the partitions 
isn't of critical importance, so performing a shuffle just to maintain balance 
is detrimental to the overall job performance.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josiah Berkebile updated SPARK-5997:

Comment: was deleted

(was: This functionality can be useful when re-partitioning to output to a 
storage system like HDFS that doesn't handle large amounts of small files well. 
 I've had a few Spark applications I've written where I've had to fan-out the 
number of partitions to reduce stress on the cluster for heavy calculations, 
and then reduce that number back down to one partition per executor to reduce 
the number of output files to HDFS in order to keep the block-count down.

However, I can't think of a scenario where it would make sense to reduce the 
number of partitions to be less than the number of executors.  If someone needs 
to do this, I think it would be more appropriate to schedule a job to sweep-up 
after Spark and concatenate its output into a single file.)

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5997) Increase partition count without performing a shuffle

2018-03-12 Thread Josiah Berkebile (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395405#comment-16395405
 ] 

Josiah Berkebile commented on SPARK-5997:
-

This functionality can be useful when re-partitioning to output to a storage 
system like HDFS that doesn't handle large amounts of small files well.  I've 
had a few Spark applications I've written where I've had to fan-out the number 
of partitions to reduce stress on the cluster for heavy calculations, and then 
reduce that number back down to one partition per executor to reduce the number 
of output files to HDFS in order to keep the block-count down.

However, I can't think of a scenario where it would make sense to reduce the 
number of partitions to be less than the number of executors.  If someone needs 
to do this, I think it would be more appropriate to schedule a job to sweep-up 
after Spark and concatenate its output into a single file.

> Increase partition count without performing a shuffle
> -
>
> Key: SPARK-5997
> URL: https://issues.apache.org/jira/browse/SPARK-5997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Andrew Ash
>Priority: Major
>
> When decreasing partition count with rdd.repartition() or rdd.coalesce(), the 
> user has the ability to choose whether or not to perform a shuffle.  However 
> when increasing partition count there is no option of whether to perform a 
> shuffle or not -- a shuffle always occurs.
> This Jira is to create a {{rdd.repartition(largeNum, shuffle=false)}} call 
> that performs a repartition to a higher partition count without a shuffle.
> The motivating use case is to decrease the size of an individual partition 
> enough that the .toLocalIterator has significantly reduced memory pressure on 
> the driver, as it loads a partition at a time into the driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-12 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395390#comment-16395390
 ] 

Stu (Michael Stewart) edited comment on SPARK-23645 at 3/12/18 3:28 PM:


[~hyukjin.kwon] thanks for the thoughts. it actually turned out to be easier 
than i'd expected to get most of the way there. the issue, as usual, is 
python2. i failed the existing unit tests on attempts to call 
`inspect.getargspec` on a callable class and on a partial function. in python 
these two concepts are oddly differentiated from functions. in python 3 it is 
handled seamlessly by `inspect.getfullargspec`. of course our friend getargspec 
is deprecated since 3.0 but there is really no alternative for py2. 

 

one middle ground that might be acceptable is to raise an error in python2 if a 
user passed keyword args to a partial fn object/callable object, but allow 
usage on functions. i suspect the vast majority of usecases of UDF in python 
rely on actual plain-old functions. this would be a clear functionality 
improvement over present for quite few loc.

 

that is:

py2 - raise error as mentioned above, otherwise handle functions with kwargs 
normally

py3 - everything just works

 

[https://github.com/apache/spark/pull/20798]


was (Author: mstewart141):
[~hyukjin.kwon] thanks for the thoughts. it actually turned out to be easier 
than i'd expected to get most of the way there. the issue, as usual, is 
python2. i failed the existing unit tests on attempts to call 
`inspect.getargspec` on a callable class and on a partial function. in python 
these two concepts are oddly differentiated from functions. in python 3 it is 
handled seamlessly by `inspect.getfullargspec`. of course our friend getargspec 
is deprecated since 3.0 but there is really no alternative for py2. 

 

one middle ground that might be acceptable is to raise an error in python2 if a 
user passed keyword args to a partial fn object/callable object, but allow 
usage on functions. i suspect the vast majority of usecases of UDF in python 
rely on actual plain-old functions. 

 

that is:

py2 - raise error as mentioned above, otherwise handle functions with kwargs 
normally

py3 - everything just works

 

https://github.com/apache/spark/pull/20798

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', 

[jira] [Commented] (SPARK-23645) pandas_udf can not be called with keyword arguments

2018-03-12 Thread Stu (Michael Stewart) (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395390#comment-16395390
 ] 

Stu (Michael Stewart) commented on SPARK-23645:
---

[~hyukjin.kwon] thanks for the thoughts. it actually turned out to be easier 
than i'd expected to get most of the way there. the issue, as usual, is 
python2. i failed the existing unit tests on attempts to call 
`inspect.getargspec` on a callable class and on a partial function. in python 
these two concepts are oddly differentiated from functions. in python 3 it is 
handled seamlessly by `inspect.getfullargspec`. of course our friend getargspec 
is deprecated since 3.0 but there is really no alternative for py2. 

 

one middle ground that might be acceptable is to raise an error in python2 if a 
user passed keyword args to a partial fn object/callable object, but allow 
usage on functions. i suspect the vast majority of usecases of UDF in python 
rely on actual plain-old functions. 

 

that is:

py2 - raise error as mentioned above, otherwise handle functions with kwargs 
normally

py3 - everything just works

 

https://github.com/apache/spark/pull/20798

> pandas_udf can not be called with keyword arguments
> ---
>
> Key: SPARK-23645
> URL: https://issues.apache.org/jira/browse/SPARK-23645
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
> Environment: python 3.6 | pyspark 2.3.0 | Using Scala version 2.11.8, 
> OpenJDK 64-Bit Server VM, 1.8.0_141
>Reporter: Stu (Michael Stewart)
>Priority: Minor
>
> pandas_udf (all python udfs(?)) do not accept keyword arguments because 
> `pyspark/sql/udf.py` class `UserDefinedFunction` has __call__, and also 
> wrapper utility methods, that only accept args and not kwargs:
> @ line 168:
> {code:java}
> ...
> def __call__(self, *cols):
> judf = self._judf
> sc = SparkContext._active_spark_context
> return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
> # This function is for improving the online help system in the interactive 
> interpreter.
> # For example, the built-in help / pydoc.help. It wraps the UDF with the 
> docstring and
> # argument annotation. (See: SPARK-19161)
> def _wrapped(self):
> """
> Wrap this udf with a function and attach docstring from func
> """
> # It is possible for a callable instance without __name__ attribute or/and
> # __module__ attribute to be wrapped here. For example, 
> functools.partial. In this case,
> # we should avoid wrapping the attributes from the wrapped function to 
> the wrapper
> # function. So, we take out these attribute names from the default names 
> to set and
> # then manually assign it after being wrapped.
> assignments = tuple(
> a for a in functools.WRAPPER_ASSIGNMENTS if a != '__name__' and a != 
> '__module__')
> @functools.wraps(self.func, assigned=assignments)
> def wrapper(*args):
> return self(*args)
> ...{code}
> as seen in:
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import pandas_udf, PandasUDFType, col, lit
> spark = SparkSession.builder.getOrCreate()
> df = spark.range(12).withColumn('b', col('id') * 2)
> def ok(a,b): return a*b
> df.withColumn('ok', pandas_udf(f=ok, returnType='bigint')('id','b')).show()  
> # no problems
> df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()  # fail with ~no stacktrace thanks 
> to wrapper helper
> ---
> TypeError Traceback (most recent call last)
>  in ()
> > 1 df.withColumn('ok', pandas_udf(f=ok, 
> returnType='bigint')(a='id',b='b')).show()
> TypeError: wrapper() got an unexpected keyword argument 'a'{code}
>  
>  
> *discourse*: it isn't difficult to swap back in the kwargs, allowing the UDF 
> to be called as such, but the cols tuple that gets passed in the call method:
> {code:java}
> _to_seq(sc, cols, _to_java_column{code}
>  has to be in the right order based on the functions defined argument inputs, 
> or the function will return incorrect results. so, the challenge here is to:
> (a) make sure to reconstruct the proper order of the full args/kwargs
> --> args first, and then kwargs (not in the order passed but in the order 
> requested by the fn)
> (b) handle python2 and python3 `inspect` module inconsistencies 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-03-12 Thread Julien Cuquemelle (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395364#comment-16395364
 ] 

Julien Cuquemelle commented on SPARK-22683:
---

PR updated, including [~xuefuz]'s proposal

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395259#comment-16395259
 ] 

Apache Spark commented on SPARK-23653:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20803

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.14.png, Screen Shot 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
>  !Screen Shot 2018-03-12 at 14.25.51.png! 
> # sql statement submitted in spark-shell or spark-submit cannot be covered.
>  !Screen Shot 2018-03-12 at 20.16.23.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23653:


Assignee: Apache Spark

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.14.png, Screen Shot 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
>  !Screen Shot 2018-03-12 at 14.25.51.png! 
> # sql statement submitted in spark-shell or spark-submit cannot be covered.
>  !Screen Shot 2018-03-12 at 20.16.23.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23653:


Assignee: (was: Apache Spark)

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.14.png, Screen Shot 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
>  !Screen Shot 2018-03-12 at 14.25.51.png! 
> # sql statement submitted in spark-shell or spark-submit cannot be covered.
>  !Screen Shot 2018-03-12 at 20.16.23.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Lantao Jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395241#comment-16395241
 ] 

Lantao Jin commented on SPARK-23653:


The best way is displaying the intact sql statement in SQL Tab whatever using 
spark-sql or spark-submit.
 !Screen Shot 2018-03-12 at 20.16.14.png! 

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.14.png, Screen Shot 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
>  !Screen Shot 2018-03-12 at 14.25.51.png! 
> # sql statement submitted in spark-shell or spark-submit cannot be covered.
>  !Screen Shot 2018-03-12 at 20.16.23.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23653:
---
Attachment: Screen Shot 2018-03-12 at 20.16.14.png

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.14.png, Screen Shot 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
>  !Screen Shot 2018-03-12 at 14.25.51.png! 
> # sql statement submitted in spark-shell or spark-submit cannot be covered.
>  !Screen Shot 2018-03-12 at 20.16.23.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23653:
---
Attachment: Screen Shot 2018-03-12 at 14.25.51.png

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
> # sql statement submitted in spark-shell or spark-submit cannot be covered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23653:
---
Description: 
[SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already added 
the sql statement in job description for using spark-sql. But it has some 
problems:
# long sql statement cannot be displayed in description column.
 !Screen Shot 2018-03-12 at 14.25.51.png! 
# sql statement submitted in spark-shell or spark-submit cannot be covered.
 !Screen Shot 2018-03-12 at 20.16.23.png! 

  was:
[SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already added 
the sql statement in job description for using spark-sql. But it has some 
problems:
# long sql statement cannot be displayed in description column.
# sql statement submitted in spark-shell or spark-submit cannot be covered.


> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 14.25.51.png, Screen Shot 
> 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
>  !Screen Shot 2018-03-12 at 14.25.51.png! 
> # sql statement submitted in spark-shell or spark-submit cannot be covered.
>  !Screen Shot 2018-03-12 at 20.16.23.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Lantao Jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-23653:
---
Attachment: Screen Shot 2018-03-12 at 20.16.23.png

> Show sql statement in spark SQL UI
> --
>
> Key: SPARK-23653
> URL: https://issues.apache.org/jira/browse/SPARK-23653
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Lantao Jin
>Priority: Major
> Attachments: Screen Shot 2018-03-12 at 20.16.23.png
>
>
> [SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already 
> added the sql statement in job description for using spark-sql. But it has 
> some problems:
> # long sql statement cannot be displayed in description column.
> # sql statement submitted in spark-shell or spark-submit cannot be covered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23653) Show sql statement in spark SQL UI

2018-03-12 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-23653:
--

 Summary: Show sql statement in spark SQL UI
 Key: SPARK-23653
 URL: https://issues.apache.org/jira/browse/SPARK-23653
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Lantao Jin


[SPARK-4871|https://issues.apache.org/jira/browse/SPARK-4871] has already added 
the sql statement in job description for using spark-sql. But it has some 
problems:
# long sql statement cannot be displayed in description column.
# sql statement submitted in spark-shell or spark-submit cannot be covered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23462) Improve the error message in `StructType`

2018-03-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-23462:


Assignee: Xiayun Sun

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiayun Sun
>Priority: Major
>  Labels: starter
> Fix For: 2.3.1, 2.4.0
>
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23462) Improve the error message in `StructType`

2018-03-12 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23462.
--
   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0

Issue resolved by pull request 20649
[https://github.com/apache/spark/pull/20649]

> Improve the error message in `StructType`
> -
>
> Key: SPARK-23462
> URL: https://issues.apache.org/jira/browse/SPARK-23462
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiayun Sun
>Priority: Major
>  Labels: starter
> Fix For: 2.4.0, 2.3.1
>
>
> The error message {{s"""Field "$name" does not exist."""}} is thrown when 
> looking up an unknown field in StructType. In the error message, we should 
> also contain the information about which columns/fields exist in this struct. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23652) Spark Connection with S3

2018-03-12 Thread Abhishek Shrivastava (JIRA)
Abhishek Shrivastava created SPARK-23652:


 Summary: Spark Connection with S3
 Key: SPARK-23652
 URL: https://issues.apache.org/jira/browse/SPARK-23652
 Project: Spark
  Issue Type: Question
  Components: Spark Shell, Spark Submit
Affects Versions: 1.6.0
Reporter: Abhishek Shrivastava


In below spark-shell I am trying to connect to S3 and load file to create 
dataframe:

 

{{spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 scala> val 
sqlContext = new org.apache.spark.sql.SQLContext(sc) scala> 
sc.hadoopConfiguration.set("fs.s3a.access.key", "") scala> 
sc.hadoopConfiguration.set("fs.s3a.secret.key", "") scala> val weekly = 
sqlContext.read.format("com.databricks.spark.csv").option("header", 
"true").option("delimiter", ",").load("s3://usr_bucket/data/file.csv") scala> 
print(weekly) scala> weekly.show()}}

 

 

{{Error:}}

{{java.lang.VerifyError: Bad type on operand stack Exception Details: Location: 
org/apache/hadoop/fs/s3/Jets3tFileSystemStore.initialize(Ljava/net/URI;Lorg/apache/hadoop/conf/Configuration;)V
 @43: invokespecial Reason: Type 'org/jets3t/service/security/AWSCredentials' 
(current frame, stack[3]) is not assignable to 
'org/jets3t/service/security/ProviderCredentials' Current Frame: bci: @43 
flags: \{ } locals: \{ 'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', 
'java/net/URI', 'org/apache/hadoop/conf/Configuration', 
'org/apache/hadoop/fs/s3/S3Credentials', 
'org/jets3t/service/security/AWSCredentials' } stack: \{ 
'org/apache/hadoop/fs/s3/Jets3tFileSystemStore', uninitialized 37, 
uninitialized 37, 'org/jets3t/service/security/AWSCredentials' } Bytecode: 
000: 2a2c b500 02bb 0003 59b7 0004 4e2d 2b2c 010: b600 05bb 0006 592d 
b600 072d b600 08b7 020: 0009 3a04 2abb 000a 5919 04b7 000b b500 030: 
0ca7 0023 3a04 1904 b600 0ec1 000f 9900 040: 0c19 04b6 000e c000 0fbf bb00 
1059 1904 050: b700 11bf 2abb 0012 592b b600 13b7 0014 060: b500 152a 
2c12 1611 1000 b600 17b5 0018 070: b1 Exception Handler Table: bci [19, 49] 
=> handler: 52 Stackmap Table: 
full_frame(@52,\{Object[#194],Object[#195],Object[#196],Object[#197]},\{Object[#198]})
 append_frame(@74,Object[#198]) chop_frame(@84,1) at 
org.apache.hadoop.fs.s3.S3FileSystem.createDefaultStore(S3FileSystem.java:119) 
at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:109) at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2816) at 
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:98) at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2853) at 
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2835) at 
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:387) at 
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296) at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:258)
 at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:229) 
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:315) 
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:202) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) 
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239) at 
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237) at 
scala.Option.getOrElse(Option.scala:120) at 
org.apache.spark.rdd.RDD.partitions(RDD.scala:237) at 
org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1307) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
org.apache.spark.rdd.RDD.take(RDD.scala:1302) at 
org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1342) at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) 
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) 
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at 
org.apache.spark.rdd.RDD.first(RDD.scala:1341) at 
com.databricks.spark.csv.CsvRelation.firstLine$lzycompute(CsvRelation.scala:269)
 at com.databricks.spark.csv.CsvRelation.firstLine(CsvRelation.scala:265) at 

[jira] [Commented] (SPARK-23651) Add a check for host name

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395094#comment-16395094
 ] 

Apache Spark commented on SPARK-23651:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/20802

> Add a  check for host name
> --
>
> Key: SPARK-23651
> URL: https://issues.apache.org/jira/browse/SPARK-23651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> I encountered a error like this:
> _org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@ci_164:42849_
>     _at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
>     _at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
>     _at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
>     _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
>     _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
>     _at org.apache.spark.executor.Executor.(Executor.scala:155)_
>     _at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
>     _at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
>     _at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_
>  
> I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
> invalid, so i think we should give a clearer reminder for this error.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23565) Improved error message for when the number of sources for a query changes

2018-03-12 Thread Roman Maier (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395023#comment-16395023
 ] 

Roman Maier commented on SPARK-23565:
-

I would like to take up this ticket, in case none are working on it.

> Improved error message for when the number of sources for a query changes
> -
>
> Key: SPARK-23565
> URL: https://issues.apache.org/jira/browse/SPARK-23565
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Patrick McGloin
>Priority: Minor
>
> If you change the number of sources for a Structured Streaming query then you 
> will get an assertion error as the number of sources in the checkpoint does 
> not match the number of sources in the query that is starting.  This can 
> happen if, for example, you add a union to the input of the query.  This is 
> of course correct but the error is a bit cryptic and requires investigation.
> Suggestion for a more informative error message =>
> The number of sources for this query has changed.  There are [x] sources in 
> the checkpoint offsets and now there are [y] sources requested by the query.  
> Cannot continue.
> This is the current message.
> 02-03-2018 13:14:22 ERROR StreamExecution:91 - Query ORPositionsState to 
> Kafka [id = 35f71e63-dbd0-49e9-98b2-a4c72a7da80e, runId = 
> d4439aca-549c-4ef6-872e-29fbfde1df78] terminated with error 
> java.lang.AssertionError: assertion failed at 
> scala.Predef$.assert(Predef.scala:156) at 
> org.apache.spark.sql.execution.streaming.OffsetSeq.toStreamProgress(OffsetSeq.scala:38)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$populateStartOffsets(StreamExecution.scala:429)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(StreamExecution.scala:297)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1$$anonfun$apply$mcZ$sp$1.apply(StreamExecution.scala:294)
>  at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:279)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>  at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:294)
>  at 
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-03-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395014#comment-16395014
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

So, the basic implementation is ready. Please, feel free to try it out. 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Assignee: Apache Spark
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16859) History Server storage information is missing

2018-03-12 Thread Andrei Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394993#comment-16394993
 ] 

Andrei Ivanov commented on SPARK-16859:
---

[~Yohan123], it really looks like this was implemented in 2.3. I haven't tested 
it yet though.

> History Server storage information is missing
> -
>
> Key: SPARK-16859
> URL: https://issues.apache.org/jira/browse/SPARK-16859
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Andrei Ivanov
>Priority: Major
>  Labels: historyserver, newbie
>
> It looks like job history storage tab in history server is broken for 
> completed jobs since *1.6.2*. 
> More specifically it's broken since 
> [SPARK-13845|https://issues.apache.org/jira/browse/SPARK-13845].
> I've fixed for my installation by effectively reverting the above patch 
> ([see|https://github.com/EinsamHauer/spark/commit/3af62ea09af8bb350c8c8a9117149c09b8feba08]).
> IMHO, the most straightforward fix would be to implement 
> _SparkListenerBlockUpdated_ serialization to JSON in _JsonProtocol_ making 
> sure it works from _ReplayListenerBus_.
> The downside will be that it will still work incorrectly with pre patch job 
> histories. But then, it doesn't work since *1.6.2* anyhow.
> PS: I'd really love to have this fixed eventually. But I'm pretty new to 
> Apache Spark and missing hands on Scala experience. So  I'd prefer that it be 
> fixed by someone experienced with roadmap vision. If nobody volunteers I'll 
> try to patch myself.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data

2018-03-12 Thread dzcxzl (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-23603:
---
Description: 
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length of 
the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]

[https://github.com/FasterXML/jackson-core/issues/307]

spark-shell:
{code:java}
val value = "x" * 3000
val json = s"""{"big": "$value"}"""
spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect

res0: Array[org.apache.spark.sql.Row] = Array([2991])
{code}
expect result : 3000 
 actual result  : 2991

There are two solutions
 One is
*Bump jackson from 2.6.7&2.6.7.1 to 2.7.7*
 The other one is
 *Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text)*

 

  was:
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length of 
the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]

[https://github.com/FasterXML/jackson-core/issues/307]

 

spark-shell:
{code:java}
val value = "x" * 3000
val json = s"""{"big": "$value"}"""
spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect

res0: Array[org.apache.spark.sql.Row] = Array([2991])
{code}
expect result : 3000 
actual result  : 2991

There are two solutions
 One is
 bump jackson version to 2.7.7
 The other one is
 Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text)

 


> When the length of the json is in a range,get_json_object will result in 
> missing tail data
> --
>
> Key: SPARK-23603
> URL: https://issues.apache.org/jira/browse/SPARK-23603
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Major
>
> Jackson(>=2.7.7) fixes the possibility of missing tail data when the length 
> of the value is in a range
> [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]
> [https://github.com/FasterXML/jackson-core/issues/307]
> spark-shell:
> {code:java}
> val value = "x" * 3000
> val json = s"""{"big": "$value"}"""
> spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect
> res0: Array[org.apache.spark.sql.Row] = Array([2991])
> {code}
> expect result : 3000 
>  actual result  : 2991
> There are two solutions
>  One is
> *Bump jackson from 2.6.7&2.6.7.1 to 2.7.7*
>  The other one is
>  *Replace writeRaw(char[] text, int offset, int len) with writeRaw(String 
> text)*
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23603) When the length of the json is in a range,get_json_object will result in missing tail data

2018-03-12 Thread dzcxzl (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dzcxzl updated SPARK-23603:
---
Description: 
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length of 
the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]

[https://github.com/FasterXML/jackson-core/issues/307]

 

spark-shell:
{code:java}
val value = "x" * 3000
val json = s"""{"big": "$value"}"""
spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect

res0: Array[org.apache.spark.sql.Row] = Array([2991])
{code}
expect result : 3000 
actual result  : 2991

There are two solutions
 One is
 bump jackson version to 2.7.7
 The other one is
 Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text)

 

  was:
Jackson(>=2.7.7) fixes the possibility of missing tail data when the length of 
the value is in a range

[https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]

[https://github.com/FasterXML/jackson-core/issues/307]

 

spark-shell:

 
{code:java}
val value = "x" * 3000
val json = s"""{"big": "$value"}"""
spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect

res0: Array[org.apache.spark.sql.Row] = Array([2991])
{code}
correct result : 3000

 

 

There are two solutions
One is
bump jackson version to 2.7.7
The other one is
Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text)

 


> When the length of the json is in a range,get_json_object will result in 
> missing tail data
> --
>
> Key: SPARK-23603
> URL: https://issues.apache.org/jira/browse/SPARK-23603
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Major
>
> Jackson(>=2.7.7) fixes the possibility of missing tail data when the length 
> of the value is in a range
> [https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7]
> [https://github.com/FasterXML/jackson-core/issues/307]
>  
> spark-shell:
> {code:java}
> val value = "x" * 3000
> val json = s"""{"big": "$value"}"""
> spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect
> res0: Array[org.apache.spark.sql.Row] = Array([2991])
> {code}
> expect result : 3000 
> actual result  : 2991
> There are two solutions
>  One is
>  bump jackson version to 2.7.7
>  The other one is
>  Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23651) Add a check for host name

2018-03-12 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394926#comment-16394926
 ] 

Apache Spark commented on SPARK-23651:
--

User '10110346' has created a pull request for this issue:
https://github.com/apache/spark/pull/20801

> Add a  check for host name
> --
>
> Key: SPARK-23651
> URL: https://issues.apache.org/jira/browse/SPARK-23651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> I encountered a error like this:
> _org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@ci_164:42849_
>     _at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
>     _at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
>     _at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
>     _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
>     _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
>     _at org.apache.spark.executor.Executor.(Executor.scala:155)_
>     _at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
>     _at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
>     _at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_
>  
> I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
> invalid, so i think we should give a clearer reminder for this error.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23651) Add a check for host name

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23651:


Assignee: (was: Apache Spark)

> Add a  check for host name
> --
>
> Key: SPARK-23651
> URL: https://issues.apache.org/jira/browse/SPARK-23651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Priority: Minor
>
> I encountered a error like this:
> _org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@ci_164:42849_
>     _at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
>     _at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
>     _at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
>     _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
>     _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
>     _at org.apache.spark.executor.Executor.(Executor.scala:155)_
>     _at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
>     _at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
>     _at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_
>  
> I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
> invalid, so i think we should give a clearer reminder for this error.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23651) Add a check for host name

2018-03-12 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23651:


Assignee: Apache Spark

> Add a  check for host name
> --
>
> Key: SPARK-23651
> URL: https://issues.apache.org/jira/browse/SPARK-23651
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: liuxian
>Assignee: Apache Spark
>Priority: Minor
>
> I encountered a error like this:
> _org.apache.spark.SparkException: Invalid Spark URL: 
> spark://HeartbeatReceiver@ci_164:42849_
>     _at 
> org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
>     _at 
> org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
>     _at 
> org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
>     _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
>     _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
>     _at org.apache.spark.executor.Executor.(Executor.scala:155)_
>     _at 
> org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
>     _at 
> org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
>     _at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_
>  
> I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
> invalid, so i think we should give a clearer reminder for this error.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23651) Add a check for host name

2018-03-12 Thread liuxian (JIRA)
liuxian created SPARK-23651:
---

 Summary: Add a  check for host name
 Key: SPARK-23651
 URL: https://issues.apache.org/jira/browse/SPARK-23651
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: liuxian


I encountered a error like this:

_org.apache.spark.SparkException: Invalid Spark URL: 
spark://HeartbeatReceiver@ci_164:42849_
    _at 
org.apache.spark.rpc.RpcEndpointAddress$.apply(RpcEndpointAddress.scala:66)_
    _at 
org.apache.spark.rpc.netty.NettyRpcEnv.asyncSetupEndpointRefByURI(NettyRpcEnv.scala:134)_
    _at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)_
    _at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:109)_
    _at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:32)_
    _at org.apache.spark.executor.Executor.(Executor.scala:155)_
    _at 
org.apache.spark.scheduler.local.LocalEndpoint.(LocalSchedulerBackend.scala:59)_
    _at 
org.apache.spark.scheduler.local.LocalSchedulerBackend.start(LocalSchedulerBackend.scala:126)_
    _at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:164)_

 

I didn't  know why this _URL_(spark://HeartbeatReceiver@ci_164:42849) is 
invalid, so i think we should give a clearer reminder for this error.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23632) sparkR.session() error with spark packages - JVM is not ready after 10 seconds

2018-03-12 Thread Jaehyeon Kim (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16394865#comment-16394865
 ] 

Jaehyeon Kim commented on SPARK-23632:
--

It wouldn't be an issue if code is run in an interactive way or spark session 
is created previously. For the former, I can just repeat _sparkR.session()_ 
and, for the latter, packages that're downloaded will be used.

However, let say a new Spark cluster is spinned off and code is run by _Rscript 
app-code.R_, it'll fail due to the timeout.

> sparkR.session() error with spark packages - JVM is not ready after 10 seconds
> --
>
> Key: SPARK-23632
> URL: https://issues.apache.org/jira/browse/SPARK-23632
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0, 2.2.1, 2.3.0
>Reporter: Jaehyeon Kim
>Priority: Minor
>
> Hi
> When I execute _sparkR.session()_ with _org.apache.hadoop:hadoop-aws:2.8.2_ 
> as following,
> {code:java}
> library(SparkR, lib.loc=file.path(Sys.getenv('SPARK_HOME'),'R', 'lib'))
> ext_opts <- '-Dhttp.proxyHost=10.74.1.25 -Dhttp.proxyPort=8080 
> -Dhttps.proxyHost=10.74.1.25 -Dhttps.proxyPort=8080'
> sparkR.session(master = "spark://master:7077",
>appName = 'ml demo',
>sparkConfig = list(spark.driver.memory = '2g'), 
>sparkPackages = 'org.apache.hadoop:hadoop-aws:2.8.2',
>spark.driver.extraJavaOptions = ext_opts)
> {code}
> I see *JVM is not ready after 10 seconds* error. Below shows some of the log 
> messages.
> {code:java}
> Ivy Default Cache set to: /home/rstudio/.ivy2/cache
> The jars for the packages stored in: /home/rstudio/.ivy2/jars
> :: loading settings :: url = 
> jar:file:/usr/local/spark-2.2.1/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
> org.apache.hadoop#hadoop-aws added as a dependency
> :: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
>   confs: [default]
>   found org.apache.hadoop#hadoop-aws;2.8.2 in central
> ...
> ...
>   found javax.servlet.jsp#jsp-api;2.1 in central
> Error in sparkR.sparkContext(master, appName, sparkHome, sparkConfigMap,  : 
>   JVM is not ready after 10 seconds
> ...
> ...
>   found joda-time#joda-time;2.9.4 in central
> downloading 
> https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.8.2/hadoop-aws-2.8.2.jar
>  ...
> ...
> ...
>   xmlenc#xmlenc;0.52 from central in [default]
>   -
>   |  |modules||   artifacts   |
>   |   conf   | number| search|dwnlded|evicted|| number|dwnlded|
>   -
>   |  default |   76  |   76  |   76  |   0   ||   76  |   76  |
>   -
> :: retrieving :: org.apache.spark#spark-submit-parent
>   confs: [default]
>   76 artifacts copied, 0 already retrieved (27334kB/56ms)
> {code}
> It's fine if I re-execute it after the package and its dependencies are 
> downloaded.
> I consider it's because of this part - 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L181
> {code:java}
> if (!file.exists(path)) {
>   stop("JVM is not ready after 10 seconds")
> }
> {code}
> Just wonder if it may be possible to update so that a user can determine how 
> much to wait?
> Thanks.
> Regards
> Jaehyeon



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org