[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-20 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964672#comment-14964672
 ] 

Sun Rui commented on SPARK-10903:
-

I figured out the reason. You have to export the S3 methods in NAMESPACE.
You can take a look at 
https://github.com/apache/spark/compare/master...sun-rui:refactor_createDataFrame?expand=1


> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11200) NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"

2015-10-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964729#comment-14964729
 ] 

Sean Owen commented on SPARK-11200:
---

I have never seen this message or heard reports of it. It sounds like it must 
be specific to your environment. Unless you have more info, we'd have to close 
this as it doesn't have any detail to diagnose why.

> NettyRpcEnv endless message "cannot send ${message} because RpcEnv is closed"
> -
>
> Key: SPARK-11200
> URL: https://issues.apache.org/jira/browse/SPARK-11200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: hujiayin
>
> The endless messages "cannot send ${message}because RpcEnv is closed" are pop 
> up after start any of workloads in MLlib until a manual stop from person. The 
> environment is hadoop-cdh-5.3.2 Spark master version run in yarn client mode. 
> The error is from NettyRpcEnv.scala. I don't have enough time to look into 
> this issue at this time, but I can verify issue in environment with you if 
> you have fix. 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11110) Scala 2.11 build fails due to compiler errors

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-0?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-0.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Should be resolved by https://github.com/apache/spark/pull/9126

> Scala 2.11 build fails due to compiler errors
> -
>
> Key: SPARK-0
> URL: https://issues.apache.org/jira/browse/SPARK-0
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Patrick Wendell
>Assignee: Jakob Odersky
>Priority: Critical
> Fix For: 1.6.0
>
>
> Right now the 2.11 build is failing due to compiler errors in SBT (though not 
> in Maven). I have updated our 2.11 compile test harness to catch this.
> https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1667/consoleFull
> {code}
> [error] 
> /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/rpc/netty/NettyRpcEnv.scala:308:
>  no valid targets for annotation on value conf - it is discarded unused. You 
> may specify targets with meta-annotations, e.g. @(transient @param)
> [error] private[netty] class NettyRpcEndpointRef(@transient conf: SparkConf)
> [error] 
> {code}
> This is one error, but there may be others past this point (the compile fails 
> fast).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11199) Improve R context management story and add getOrCreate

2015-10-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-11199:
---
Assignee: Felix Cheung

> Improve R context management story and add getOrCreate
> --
>
> Key: SPARK-11199
> URL: https://issues.apache.org/jira/browse/SPARK-11199
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> Similar to SPARK-4
> Also from discussion in SPARK-10903:
> "
> Hossein Falaki added a comment - 08/Oct/15 13:06
> +1 We have seen a lot of questions from new SparkR users about the life cycle 
> of the context. 
> My question is: are we going to remove or deprecate sparkRSQL.init()? I 
> suggest we should, because right now calling that method creates a new Java 
> SQLContext object, and having two of them prevents users form viewing temp 
> tables.
> Felix Cheung added a comment - 08/Oct/15 17:13
> +1 perhaps sparkR.init() should create sqlContext and/or hiveCtx together.
> But Hossein Falaki, as of now calling sparkRSQL.init() should return the same 
> one as you can see 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L224
> Hossein Falaki added a comment - 08/Oct/15 17:16
> I meant the SQL Context: 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L236
> This call should have been "getOrCreate."
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10876) display total application time in spark history UI

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10876:
--
Priority: Minor  (was: Major)

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Jean-Baptiste Onofré
>Priority: Minor
> Fix For: 1.6.0
>
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10876) display total application time in spark history UI

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10876.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9059
[https://github.com/apache/spark/pull/9059]

> display total application time in spark history UI
> --
>
> Key: SPARK-10876
> URL: https://issues.apache.org/jira/browse/SPARK-10876
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Jean-Baptiste Onofré
> Fix For: 1.6.0
>
>
> The history file has an application start and application end events.  It 
> would be nice if we could use these to display the total run time for the 
> application in the history UI.
> Could be displayed similar to "Total Uptime" for a running application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11166) HIVE ON SPARK : yarn-cluster mode , if memory is busy,have no enough resource. app wil failed

2015-10-20 Thread yindu_asan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yindu_asan reopened SPARK-11166:


The cause of the isse is not only resource not enough
1. jar  error .   eg : if   build spark  , add -Phive  
2. some arguments is error

jar error Has been ruled out
arguments  is checking

but I think  ,  one problem may be  hiveClient

> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
> ---
>
> Key: SPARK-11166
> URL: https://issues.apache.org/jira/browse/SPARK-11166
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: yindu_asan
>  Labels: test
>
> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
>  ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting 
> for 10 ms. Please check earlier log output for errors. Failing the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11205) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964821#comment-14964821
 ] 

Jeff Zhang commented on SPARK-11205:


Will create PR soon. 

> Delegate to scala DataFrame API rather than print in python
> ---
>
> Key: SPARK-11205
> URL: https://issues.apache.org/jira/browse/SPARK-11205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> When I use DataFrame#explain(), I found the output is a little different from 
> scala API. Here's one example.
> {noformat}
> == Physical Plan ==// this line is removed in pyspark API
> Scan 
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
> {noformat}
> After looking at the code, I found that pyspark will print the output by 
> itself rather than delegate it to spark-sql. This cause the difference 
> between scala api and python api. I think both python api and scala api try 
> to print it to standard out, so the python api can be delegated to scala api. 
> Here's some api I found that can be delegated to scala api directly:
> * printSchema()
> * explain()
> * show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11206) Support SQL UI on the history server

2015-10-20 Thread Carson Wang (JIRA)
Carson Wang created SPARK-11206:
---

 Summary: Support SQL UI on the history server
 Key: SPARK-11206
 URL: https://issues.apache.org/jira/browse/SPARK-11206
 Project: Spark
  Issue Type: New Feature
  Components: SQL, Web UI
Reporter: Carson Wang


On the live web UI, there is a SQL tab which provides valuable information for 
the SQL query. But once the workload is finished, we won't see the SQL tab on 
the history server. It will be helpful if we support SQL UI on the history 
server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11205) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11205:


Assignee: (was: Apache Spark)

> Delegate to scala DataFrame API rather than print in python
> ---
>
> Key: SPARK-11205
> URL: https://issues.apache.org/jira/browse/SPARK-11205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> When I use DataFrame#explain(), I found the output is a little different from 
> scala API. Here's one example.
> {noformat}
> == Physical Plan ==// this line is removed in pyspark API
> Scan 
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
> {noformat}
> After looking at the code, I found that pyspark will print the output by 
> itself rather than delegate it to spark-sql. This cause the difference 
> between scala api and python api. I think both python api and scala api try 
> to print it to standard out, so the python api can be delegated to scala api. 
> Here's some api I found that can be delegated to scala api directly:
> * printSchema()
> * explain()
> * show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11205) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964840#comment-14964840
 ] 

Apache Spark commented on SPARK-11205:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9177

> Delegate to scala DataFrame API rather than print in python
> ---
>
> Key: SPARK-11205
> URL: https://issues.apache.org/jira/browse/SPARK-11205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> When I use DataFrame#explain(), I found the output is a little different from 
> scala API. Here's one example.
> {noformat}
> == Physical Plan ==// this line is removed in pyspark API
> Scan 
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
> {noformat}
> After looking at the code, I found that pyspark will print the output by 
> itself rather than delegate it to spark-sql. This cause the difference 
> between scala api and python api. I think both python api and scala api try 
> to print it to standard out, so the python api can be delegated to scala api. 
> Here's some api I found that can be delegated to scala api directly:
> * printSchema()
> * explain()
> * show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11166) HIVE ON SPARK : yarn-cluster mode , if memory is busy,have no enough resource. app wil failed

2015-10-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964850#comment-14964850
 ] 

Sean Owen commented on SPARK-11166:
---

[~673310629] please do not reopen issues if it has been closed by a committer, 
and there is no change in the discussion. This is an issue with your config; in 
any event it pertains to Hive on Spark, which is part of HIve, not Spark.

> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
> ---
>
> Key: SPARK-11166
> URL: https://issues.apache.org/jira/browse/SPARK-11166
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: yindu_asan
>  Labels: test
>
> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
>  ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting 
> for 10 ms. Please check earlier log output for errors. Failing the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-20 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11179:

Target Version/s: 1.6.0

> Push filters through aggregate if filters are subset of 'group by' expressions
> --
>
> Key: SPARK-11179
> URL: https://issues.apache.org/jira/browse/SPARK-11179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nitin Goyal
>Priority: Minor
>
> Push filters through aggregate if filters are subset of 'group by' 
> expressions. This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10709) When loading a json dataset as a data frame, if the input path is wrong, the error message is very confusing

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10709:
--
Target Version/s:   (was: 1.6.0)
Priority: Minor  (was: Major)
  Issue Type: Improvement  (was: Bug)

> When loading a json dataset as a data frame, if the input path is wrong, the 
> error message is very confusing
> 
>
> Key: SPARK-10709
> URL: https://issues.apache.org/jira/browse/SPARK-10709
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Priority: Minor
>
> If you do something like {{sqlContext.read.json("a wrong path")}}, when we 
> actually read data, the error message is 
> {code}
> java.io.IOException: No input paths specified in job
>   at 
> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:198)
>   at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
>   at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
>   at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
>   at org.apache.spark.ShuffleDependency.(Dependency.scala:91)
>   at 
> org.apache.spark.sql.execution.ShuffledRowRDD.getDependencies(ShuffledRowRDD.scala:59)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:226)
>   at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:224)
>   at scala.Option.getOrElse(Option.scala:120)
>   at org.apache.spark.rdd.RDD.dependencies(RDD.scala:224)
>   at 
> org.apache.spark.scheduler.DAGScheduler.visit$2(DAGScheduler.scala:427)
>   at 
> 

[jira] [Created] (SPARK-11204) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11204:
--

 Summary: Delegate to scala DataFrame API rather than print in 
python
 Key: SPARK-11204
 URL: https://issues.apache.org/jira/browse/SPARK-11204
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.5.1
Reporter: Jeff Zhang
Priority: Minor


When I use DataFrame#explain(), I found the output is a little different from 
scala API. Here's one example.
{noformat}
== Physical Plan ==// this line is removed in pyspark API
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
{noformat}

After looking at the code, I found that pyspark will print the output by itself 
rather than delegate it to spark-sql. This cause the difference between scala 
api and python api. I think both python api and scala api try to print it to 
standard out, so the python api can be deleted to scala api. Here's some api I 
found that can be delegated to scala api directly:
* printSchema()
* explain()
* show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11205) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-11205:
---
Description: 
When I use DataFrame#explain(), I found the output is a little different from 
scala API. Here's one example.
{noformat}
== Physical Plan ==// this line is removed in pyspark API
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
{noformat}

After looking at the code, I found that pyspark will print the output by itself 
rather than delegate it to spark-sql. This cause the difference between scala 
api and python api. I think both python api and scala api try to print it to 
standard out, so the python api can be delegated to scala api. Here's some api 
I found that can be delegated to scala api directly:
* printSchema()
* explain()
* show()

  was:
When I use DataFrame#explain(), I found the output is a little different from 
scala API. Here's one example.
{noformat}
== Physical Plan ==// this line is removed in pyspark API
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
{noformat}

After looking at the code, I found that pyspark will print the output by itself 
rather than delegate it to spark-sql. This cause the difference between scala 
api and python api. I think both python api and scala api try to print it to 
standard out, so the python api can be deleted to scala api. Here's some api I 
found that can be delegated to scala api directly:
* printSchema()
* explain()
* show()


> Delegate to scala DataFrame API rather than print in python
> ---
>
> Key: SPARK-11205
> URL: https://issues.apache.org/jira/browse/SPARK-11205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
>
> When I use DataFrame#explain(), I found the output is a little different from 
> scala API. Here's one example.
> {noformat}
> == Physical Plan ==// this line is removed in pyspark API
> Scan 
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
> {noformat}
> After looking at the code, I found that pyspark will print the output by 
> itself rather than delegate it to spark-sql. This cause the difference 
> between scala api and python api. I think both python api and scala api try 
> to print it to standard out, so the python api can be delegated to scala api. 
> Here's some api I found that can be delegated to scala api directly:
> * printSchema()
> * explain()
> * show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11205) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11205:
--

 Summary: Delegate to scala DataFrame API rather than print in 
python
 Key: SPARK-11205
 URL: https://issues.apache.org/jira/browse/SPARK-11205
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.5.1
Reporter: Jeff Zhang
Priority: Minor


When I use DataFrame#explain(), I found the output is a little different from 
scala API. Here's one example.
{noformat}
== Physical Plan ==// this line is removed in pyspark API
Scan 
JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
{noformat}

After looking at the code, I found that pyspark will print the output by itself 
rather than delegate it to spark-sql. This cause the difference between scala 
api and python api. I think both python api and scala api try to print it to 
standard out, so the python api can be deleted to scala api. Here's some api I 
found that can be delegated to scala api directly:
* printSchema()
* explain()
* show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11205) Delegate to scala DataFrame API rather than print in python

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11205:


Assignee: Apache Spark

> Delegate to scala DataFrame API rather than print in python
> ---
>
> Key: SPARK-11205
> URL: https://issues.apache.org/jira/browse/SPARK-11205
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>
> When I use DataFrame#explain(), I found the output is a little different from 
> scala API. Here's one example.
> {noformat}
> == Physical Plan ==// this line is removed in pyspark API
> Scan 
> JSONRelation[file:/Users/hadoop/github/spark/examples/src/main/resources/people.json][age#0L,name#1]
> {noformat}
> After looking at the code, I found that pyspark will print the output by 
> itself rather than delegate it to spark-sql. This cause the difference 
> between scala api and python api. I think both python api and scala api try 
> to print it to standard out, so the python api can be delegated to scala api. 
> Here's some api I found that can be delegated to scala api directly:
> * printSchema()
> * explain()
> * show()



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-10-20 Thread Carson Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964841#comment-14964841
 ] 

Carson Wang commented on SPARK-11206:
-

I am working on this. I plan to add an onOtherEvent method to the SparkListener 
trait and post all SQL related events to the same event bus. That is, we'll 
write SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd events to 
the event log.
One problem I encounter is that the core module can't import class from the sql 
module. So I need define these SQL events and related SQL data structures (like 
spark plan info) in the core module. It seems I also need move SQLTab, 
SQLListener, etc to the core module so that I can attach the SQL tab to the 
Spark history UI. Any suggestion is appreciated. [~zsxwing] 

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11166) HIVE ON SPARK : yarn-cluster mode , if memory is busy,have no enough resource. app wil failed

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-11166.
-
Resolution: Invalid

> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
> ---
>
> Key: SPARK-11166
> URL: https://issues.apache.org/jira/browse/SPARK-11166
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: yindu_asan
>  Labels: test
>
> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
>  ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting 
> for 10 ms. Please check earlier log output for errors. Failing the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2654) Leveled logging in PySpark

2015-10-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964586#comment-14964586
 ] 

Jeff Zhang commented on SPARK-2654:
---

[~davies] I think currently spark-core also don't have logging level control 
through command line argument except through log4j.properties. Although 
implementation for controlling logging level in spark-core and pyspark will be 
different, but the configuration should be same. 

Here's my initial thinking about it:
* use "spark.driver.log.level" to control log level of driver and 
"spark.executor.log.level" to control log level of "executor". 
* "spark.driver.log.level" and "spark.executor.log.level" can be set in 2 ways
** Simple Configuration: just log level like INFO/DEBUG/ERROR/...
** Advanced Configuration: Simple Configuration followed by package level 
configuration, like 
"DEBUG;org.apache.spark.sql=INFO;org.apache.spark.shuffle=INFO"

I am not sure whether there's existing jira of logging level control for 
spark-core, if you know that, please help link them together. I'd like to help 
contribute this. Thanks



> Leveled logging in PySpark
> --
>
> Key: SPARK-2654
> URL: https://issues.apache.org/jira/browse/SPARK-2654
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Davies Liu
>
> Add more leveled logging in PySpark, the logging level should be easy 
> controlled by configuration and command line arguments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11202) Unsupported dataType

2015-10-20 Thread whc (JIRA)
whc created SPARK-11202:
---

 Summary: Unsupported dataType
 Key: SPARK-11202
 URL: https://issues.apache.org/jira/browse/SPARK-11202
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.0
Reporter: whc


I read data from oracle and save as parquet ,then get the following error:
java.lang.IllegalArgumentException: Unsupported dataType: 
{"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]},
 [1.1] failure: `TimestampType' expected but `{' found

{"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]}
^
at 
org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(DataType.scala:245)
at 
org.apache.spark.sql.types.DataType$.fromCaseClassString(DataType.scala:102)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
at scala.util.Try.getOrElse(Try.scala:77)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromString(ParquetTypesConverter.scala:62)
at 
org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:51)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
at 
org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
at 
org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


I checked the type but do not have Timestamp or Date type in oracle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11201) StreamContext.getOrCreate is broken is yarn-client mode

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11201:


Assignee: Apache Spark

> StreamContext.getOrCreate is broken is yarn-client mode
> ---
>
> Key: SPARK-11201
> URL: https://issues.apache.org/jira/browse/SPARK-11201
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Hari Shreedharan
>Assignee: Apache Spark
>
> If {{StreamingContext.getOrCreate}} (or the constructors that create the 
> Hadoop {{Configuration}} is used, {{SparkHadoopUtil.get.conf}}  is called 
> before {{SparkContext}} is created - when SPARK_YARN_MODE is set. So in that 
> case {{SparkHadoopUtil.get}} creates a {{SparkHadoopUtil}} instance instead 
> of {{YarnSparkHadoopUtil}} instance.
> So, in yarn-client mode, a class cast exception gets thrown from 
> {{Client.scala}}:
> {code}
> java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot 
> be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
>   at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:169)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:266)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:631)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:120)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:523)
>   at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:854)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>   at 
> com.cloudera.test.LongRunningApp$.com$cloudera$test$LongRunningApp$$createCheckpoint$1(LongRunningApp.scala:33)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:844)
>   at com.cloudera.test.LongRunningApp$.main(LongRunningApp.scala:90)
>   at com.cloudera.test.LongRunningApp.main(LongRunningApp.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11201) StreamContext.getOrCreate is broken is yarn-client mode

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11201:


Assignee: (was: Apache Spark)

> StreamContext.getOrCreate is broken is yarn-client mode
> ---
>
> Key: SPARK-11201
> URL: https://issues.apache.org/jira/browse/SPARK-11201
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Hari Shreedharan
>
> If {{StreamingContext.getOrCreate}} (or the constructors that create the 
> Hadoop {{Configuration}} is used, {{SparkHadoopUtil.get.conf}}  is called 
> before {{SparkContext}} is created - when SPARK_YARN_MODE is set. So in that 
> case {{SparkHadoopUtil.get}} creates a {{SparkHadoopUtil}} instance instead 
> of {{YarnSparkHadoopUtil}} instance.
> So, in yarn-client mode, a class cast exception gets thrown from 
> {{Client.scala}}:
> {code}
> java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot 
> be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
>   at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:169)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:266)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:631)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:120)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:523)
>   at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:854)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>   at 
> com.cloudera.test.LongRunningApp$.com$cloudera$test$LongRunningApp$$createCheckpoint$1(LongRunningApp.scala:33)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:844)
>   at com.cloudera.test.LongRunningApp$.main(LongRunningApp.scala:90)
>   at com.cloudera.test.LongRunningApp.main(LongRunningApp.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11202) Unsupported dataType

2015-10-20 Thread whc (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

whc updated SPARK-11202:

Component/s: SQL

> Unsupported dataType
> 
>
> Key: SPARK-11202
> URL: https://issues.apache.org/jira/browse/SPARK-11202
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: whc
>
> I read data from oracle and save as parquet ,then get the following error:
> java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {"type":"struct","fields":[{"name":"DOMAIN_NAME","type":"string","nullable":true,"metadata":{"name":"DOMAIN_NAME"}},{"name":"DOMAIN_ID","type":"decimal(0,-127)","nullable":true,"metadata":{"name":"DOMAIN_ID"}}]}
> ^
> at 
> org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(DataType.scala:245)
> at 
> org.apache.spark.sql.types.DataType$.fromCaseClassString(DataType.scala:102)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$$anonfun$3.apply(ParquetTypesConverter.scala:62)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetTypesConverter$.convertFromString(ParquetTypesConverter.scala:62)
> at 
> org.apache.spark.sql.execution.datasources.parquet.RowWriteSupport.init(ParquetTableSupport.scala:51)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:288)
> at 
> org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.(ParquetRelation.scala:94)
> at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
> at 
> org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> I checked the type but do not have Timestamp or Date type in oracle



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11201) StreamContext.getOrCreate is broken is yarn-client mode

2015-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964608#comment-14964608
 ] 

Apache Spark commented on SPARK-11201:
--

User 'harishreedharan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9174

> StreamContext.getOrCreate is broken is yarn-client mode
> ---
>
> Key: SPARK-11201
> URL: https://issues.apache.org/jira/browse/SPARK-11201
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Hari Shreedharan
>
> If {{StreamingContext.getOrCreate}} (or the constructors that create the 
> Hadoop {{Configuration}} is used, {{SparkHadoopUtil.get.conf}}  is called 
> before {{SparkContext}} is created - when SPARK_YARN_MODE is set. So in that 
> case {{SparkHadoopUtil.get}} creates a {{SparkHadoopUtil}} instance instead 
> of {{YarnSparkHadoopUtil}} instance.
> So, in yarn-client mode, a class cast exception gets thrown from 
> {{Client.scala}}:
> {code}
> java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot 
> be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
>   at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:169)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:266)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:631)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:120)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:523)
>   at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:854)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>   at 
> com.cloudera.test.LongRunningApp$.com$cloudera$test$LongRunningApp$$createCheckpoint$1(LongRunningApp.scala:33)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:844)
>   at com.cloudera.test.LongRunningApp$.main(LongRunningApp.scala:90)
>   at com.cloudera.test.LongRunningApp.main(LongRunningApp.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11166) HIVE ON SPARK : yarn-cluster mode , if memory is busy,have no enough resource. app wil failed

2015-10-20 Thread yindu_asan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964857#comment-14964857
 ] 

yindu_asan commented on SPARK-11166:


thinks @Marcelo Vanzin @Sean Owen   sorry!!!  

> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
> ---
>
> Key: SPARK-11166
> URL: https://issues.apache.org/jira/browse/SPARK-11166
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: yindu_asan
>  Labels: test
>
> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
>  ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting 
> for 10 ms. Please check earlier log output for errors. Failing the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11207) Add test cases for normal LinearRegression solver as followup.

2015-10-20 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-11207:
--

 Summary: Add test cases for normal LinearRegression solver as 
followup.
 Key: SPARK-11207
 URL: https://issues.apache.org/jira/browse/SPARK-11207
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Kai Sasaki


This is the follow up work of SPARK-10668.

* Fix miner style issues.
* Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964972#comment-14964972
 ] 

Serge Smertin commented on SPARK-4368:
--

if it's decided to be hosted outside of project - is there any documented way 
to add new storage abstraction then?

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10971:


Assignee: (was: Apache Spark)

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode

2015-10-20 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-11182:
---
Affects Version/s: 1.5.1

> HDFS Delegation Token will be expired when calling 
> "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
> --
>
> Key: SPARK-11182
> URL: https://issues.apache.org/jira/browse/SPARK-11182
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Liangliang Gu
>
> In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node 
> automatically, which will not be updated when Spark update Credentials for 
> the current user.
> Spark should update these tokens in order to avoid Token Expired Error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964983#comment-14964983
 ] 

Apache Spark commented on SPARK-10971:
--

User 'sun-rui' has created a pull request for this issue:
https://github.com/apache/spark/pull/9179

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10971:


Assignee: Apache Spark

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Apache Spark
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11207) Add test cases for solver selection of LinearRegression as followup.

2015-10-20 Thread Kai Sasaki (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Sasaki updated SPARK-11207:
---
Summary: Add test cases for solver selection of LinearRegression as 
followup.  (was: Add test cases for normal LinearRegression solver as followup.)

> Add test cases for solver selection of LinearRegression as followup.
> 
>
> Key: SPARK-11207
> URL: https://issues.apache.org/jira/browse/SPARK-11207
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>  Labels: ML
>
> This is the follow up work of SPARK-10668.
> * Fix miner style issues.
> * Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11208) Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config

2015-10-20 Thread Artem Aliev (JIRA)
Artem Aliev created SPARK-11208:
---

 Summary: Filter out 'hive.metastore.rawstore.impl' from 
executionHive temporary config
 Key: SPARK-11208
 URL: https://issues.apache.org/jira/browse/SPARK-11208
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.6.0
Reporter: Artem Aliev


Spark use two hive meta stores: external one for storing tables and internal 
one (executionHive):
{code}
/**
The copy of the hive client that is used for execution. Currently this must 
always be
Hive 13 as this is the version of Hive that is packaged with Spark SQL. This 
copy of the
client is used for execution related tasks like registering temporary functions 
or ensuring
that the ThreadLocal SessionState is correctly populated. This copy of Hive is 
not used
for storing persistent metadata, and only point to a dummy metastore in a 
temporary directory. */
{code}
The executionHive assumed to be a standard meta store located in temporary 
directory as a derby db. But hive.metastore.rawstore.impl was not filtered out 
so any custom implementation of the metastore with other storage properties 
(not JDO) will persist that temporary functions. CassandraMetaStore from 
DataStax Enterprise is one of examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11208) Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config

2015-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964974#comment-14964974
 ] 

Apache Spark commented on SPARK-11208:
--

User 'artem-aliev' has created a pull request for this issue:
https://github.com/apache/spark/pull/9178

> Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config
> -
>
> Key: SPARK-11208
> URL: https://issues.apache.org/jira/browse/SPARK-11208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Artem Aliev
>
> Spark use two hive meta stores: external one for storing tables and internal 
> one (executionHive):
> {code}
> /**
> The copy of the hive client that is used for execution. Currently this must 
> always be
> Hive 13 as this is the version of Hive that is packaged with Spark SQL. This 
> copy of the
> client is used for execution related tasks like registering temporary 
> functions or ensuring
> that the ThreadLocal SessionState is correctly populated. This copy of Hive 
> is not used
> for storing persistent metadata, and only point to a dummy metastore in a 
> temporary directory. */
> {code}
> The executionHive assumed to be a standard meta store located in temporary 
> directory as a derby db. But hive.metastore.rawstore.impl was not filtered 
> out so any custom implementation of the metastore with other storage 
> properties (not JDO) will persist that temporary functions. 
> CassandraMetaStore from DataStax Enterprise is one of examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11208) Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11208:


Assignee: (was: Apache Spark)

> Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config
> -
>
> Key: SPARK-11208
> URL: https://issues.apache.org/jira/browse/SPARK-11208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Artem Aliev
>
> Spark use two hive meta stores: external one for storing tables and internal 
> one (executionHive):
> {code}
> /**
> The copy of the hive client that is used for execution. Currently this must 
> always be
> Hive 13 as this is the version of Hive that is packaged with Spark SQL. This 
> copy of the
> client is used for execution related tasks like registering temporary 
> functions or ensuring
> that the ThreadLocal SessionState is correctly populated. This copy of Hive 
> is not used
> for storing persistent metadata, and only point to a dummy metastore in a 
> temporary directory. */
> {code}
> The executionHive assumed to be a standard meta store located in temporary 
> directory as a derby db. But hive.metastore.rawstore.impl was not filtered 
> out so any custom implementation of the metastore with other storage 
> properties (not JDO) will persist that temporary functions. 
> CassandraMetaStore from DataStax Enterprise is one of examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11209) Add window functions into SparkR [step 1]

2015-10-20 Thread Sun Rui (JIRA)
Sun Rui created SPARK-11209:
---

 Summary: Add window functions into SparkR [step 1]
 Key: SPARK-11209
 URL: https://issues.apache.org/jira/browse/SPARK-11209
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Sun Rui


Add the following 4 window functions into SparkR:

lead
cumuDist
lag
ntile




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11209) Add window functions into SparkR [step 1]

2015-10-20 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964995#comment-14964995
 ] 

Sun Rui commented on SPARK-11209:
-

I am working on this.

> Add window functions into SparkR [step 1]
> -
>
> Key: SPARK-11209
> URL: https://issues.apache.org/jira/browse/SPARK-11209
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Sun Rui
>
> Add the following 4 window functions into SparkR:
> lead
> cumuDist
> lag
> ntile



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964994#comment-14964994
 ] 

Steve Loughran commented on SPARK-4368:
---

Looking at the ceph code, the strategy is

1. add the ceph JAR to the classpath via {{--jars}}
2. to your spark config or hadoop site, declare the new FS

{code}

  fs.AbstractFileSystem.ceph.impl
  org.apache.hadoop.fs.ceph.CephFs


  fs.ceph.impl
  org.apache.hadoop.fs.ceph.CephFileSystem

{code}

3. use a {{ceph://}} URL for referring to source or dest.

the cephfs JAR should really be listing its fs in the resource 
{{src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem}} for 
auto-loading without editing the config files, but that's something to take up 
with them (or you create the relevant resource in your source tree —look at a 
[hadoop 
example|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-azure/src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem])

4. Finally, Cephfs may not be a real FS as far as Hadoop APIs are concerned, 
some operations may be non-atomic, non-concurrent, fail differently, etc. What 
may happen in particular is on a non kerberized YARN cluster, your work runs as 
user 'yarn', so you don't get access to your own data. And, on a secure 
cluster, you are on your own w.r.t ticket renewal. Nobody but you will be 
testing Spark on Ceph. YMMV.

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10043) Add window functions into SparkR

2015-10-20 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964990#comment-14964990
 ] 

Sun Rui commented on SPARK-10043:
-

Break this issue down to 2 sub-issues.

> Add window functions into SparkR
> 
>
> Key: SPARK-10043
> URL: https://issues.apache.org/jira/browse/SPARK-10043
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yu Ishikawa
>
> Add window functions as follows in SparkR. I think we should improve 
> {{collect}} function in SparkR.
> - lead
> - cumuDist
> - denseRank
> - lag
> - ntile
> - percentRank
> - rank
> - rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964970#comment-14964970
 ] 

Serge Smertin commented on SPARK-4368:
--

here you go - Hadoop Filesystem implementation on top of Ceph. It even supports 
data locality. This project also has Vagrant 
https://github.com/ceph/cephfs-hadoop/blob/master/src/main/java/org/apache/hadoop/fs/ceph/CephFileSystem.java#L538

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11208) Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11208:


Assignee: Apache Spark

> Filter out 'hive.metastore.rawstore.impl' from executionHive temporary config
> -
>
> Key: SPARK-11208
> URL: https://issues.apache.org/jira/browse/SPARK-11208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Artem Aliev
>Assignee: Apache Spark
>
> Spark use two hive meta stores: external one for storing tables and internal 
> one (executionHive):
> {code}
> /**
> The copy of the hive client that is used for execution. Currently this must 
> always be
> Hive 13 as this is the version of Hive that is packaged with Spark SQL. This 
> copy of the
> client is used for execution related tasks like registering temporary 
> functions or ensuring
> that the ThreadLocal SessionState is correctly populated. This copy of Hive 
> is not used
> for storing persistent metadata, and only point to a dummy metastore in a 
> temporary directory. */
> {code}
> The executionHive assumed to be a standard meta store located in temporary 
> directory as a derby db. But hive.metastore.rawstore.impl was not filtered 
> out so any custom implementation of the metastore with other storage 
> properties (not JDO) will persist that temporary functions. 
> CassandraMetaStore from DataStax Enterprise is one of examples.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11210) Add window functions into SparkR [step 2]

2015-10-20 Thread Sun Rui (JIRA)
Sun Rui created SPARK-11210:
---

 Summary: Add window functions into SparkR [step 2]
 Key: SPARK-11210
 URL: https://issues.apache.org/jira/browse/SPARK-11210
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Sun Rui


Add the following 4 windows functions into SparkR:
denseRank
percentRank
rank
rowNumber



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11206) Support SQL UI on the history server

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964997#comment-14964997
 ] 

Steve Loughran commented on SPARK-11206:


There's another strategy which would have to be taken up with the spark 
developers: unwind the dependencies to make the history server downstream of sql

> Support SQL UI on the history server
> 
>
> Key: SPARK-11206
> URL: https://issues.apache.org/jira/browse/SPARK-11206
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL, Web UI
>Reporter: Carson Wang
>
> On the live web UI, there is a SQL tab which provides valuable information 
> for the SQL query. But once the workload is finished, we won't see the SQL 
> tab on the history server. It will be helpful if we support SQL UI on the 
> history server so we can analyze it even after its execution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11207) Add test cases for solver selection of LinearRegression as followup.

2015-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965117#comment-14965117
 ] 

Apache Spark commented on SPARK-11207:
--

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/9180

> Add test cases for solver selection of LinearRegression as followup.
> 
>
> Key: SPARK-11207
> URL: https://issues.apache.org/jira/browse/SPARK-11207
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>  Labels: ML
>
> This is the follow up work of SPARK-10668.
> * Fix miner style issues.
> * Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965118#comment-14965118
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

To thread safe, the HashMap blockIdToSeqNumRanges "extends" SynchronizedMap 
(with mutable.SynchronizedMap).

In the onStart() function, we clear the blockIdToSeqNumRanges map. There, the 
clear() method is the one from SynchronizedMap: it's where the 
ClassCastException happens.

I'm checking why. Do you use Scala 2.11 or 2.10 ?

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11207) Add test cases for solver selection of LinearRegression as followup.

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11207:


Assignee: Apache Spark

> Add test cases for solver selection of LinearRegression as followup.
> 
>
> Key: SPARK-11207
> URL: https://issues.apache.org/jira/browse/SPARK-11207
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>  Labels: ML
>
> This is the follow up work of SPARK-10668.
> * Fix miner style issues.
> * Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11207) Add test cases for solver selection of LinearRegression as followup.

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11207:


Assignee: (was: Apache Spark)

> Add test cases for solver selection of LinearRegression as followup.
> 
>
> Key: SPARK-11207
> URL: https://issues.apache.org/jira/browse/SPARK-11207
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Kai Sasaki
>  Labels: ML
>
> This is the follow up work of SPARK-10668.
> * Fix miner style issues.
> * Add test case for checking whether solver is selected properly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965125#comment-14965125
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

By the way, with spark 1.4.x, the KinesisReceiver didn't use the 
SynchronizedMap (the code was simpler).

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11173) Cannot save data via MSSQL JDBC

2015-10-20 Thread Gianluca Salvo (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gianluca Salvo reopened SPARK-11173:


Hi,
I suppose I've found the problem: I downloaded the source code and followed the 
stack trace excpetion. The exception starts in DataFrameWriter.scala, row 275 
because it's trying to create a table that already exists.
{code}
if (!tableExists) {
val schema = JdbcUtils.schemaString(df, url)
val sql = s"CREATE TABLE $table ($schema)"
conn.prepareStatement(sql).executeUpdate()
  }
{code}
In the same file, at row 256, it's invoked 
{code}
var tableExists = JdbcUtils.tableExists(conn, table).
{code}
In JdbcUtils.scala, the function tableExists uses the following code to check 
the presence of the table
{code}
Try(conn.prepareStatement(s"SELECT 1 FROM $table LIMIT 
1").executeQuery().next()).isSuccess
{code}
_This code is not valid in SQL Server that implements *TOP* to limit the number 
of rows_. 
May you change the test query in:
{code}
select 1 from $table where 1=0
{code}
You have an exception if the table does not exist and you don't touch the data 
but you're compliant with more RDMBS.

Thanks,

Best regards

> Cannot save data via MSSQL JDBC
> ---
>
> Key: SPARK-11173
> URL: https://issues.apache.org/jira/browse/SPARK-11173
> Project: Spark
>  Issue Type: Bug
>  Components: Java API, PySpark, SQL
>Affects Versions: 1.5.1
> Environment: Windows 7 sp1 x64, java version "1.8.0_60", Spark 1.5.1, 
> hadoop 2.6, microsoft jdbc 4.2, pyspark
>Reporter: Gianluca Salvo
>  Labels: patch
>
> Hello,
> I'm experiencing an issue in writing dataframe via JBDC. My code is
> {code:title=Example.python|borderStyle=solid}
> from pyspark import SparkContext
> from pyspark.sql import SQLContext
> import sys
> sc=SparkContext(appName="SQL Query")
> sqlctx=SQLContext(sc)
> serverName="SQLIPAddress"
> serverPort="SQL Port"
> serverUsername="username"
> serverPassword="password"
> serverDatabase="database"
> #
> connString="jdbc:sqlserver://{SERVER}:{PORT};user={USER};password={PASSWORD};databasename={DATABASENAME}"
> connString=connString.format(SERVER=serverName,PORT=serverPort,USER=serverUsername,PASSWORD=serverPassword,DATABASENAME=serverDatabase)
> df=sqlctx.read.format("jdbc").options(url=connString,dbtable="(select * from 
> TestTable) as test_Table").load()
> df.show()
> try:
> df.write.jdbc(connString,"Test_Target","append")
> print("saving completed")
> except:
> print("Error in saving data",sys.exc_info()[0])
> sc.stop()
> {code}
> Even if i specify *append*, the code throws an exception saying it is trying 
> to create the table *Test_Target* but the table is already present.
> If I target the script to MariaDB, all is fine
> {code:title=New Connection string|borderStyle=solid}
> connString="jdbc:mysql://{SERVER}:{PORT}/{DATABASENAME}?user={USER}={PASSWORD}";
> {code}
> The problem seems to be the Microsoft JDBC driver. Can you suggest or 
> implement same workaround?
> Best regards



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11212) Make RDD's preferred locations support the executor location and fix ReceiverTracker for multiple executors in a host

2015-10-20 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-11212:


 Summary: Make RDD's preferred locations support the executor 
location and fix ReceiverTracker for multiple executors in a host  
 Key: SPARK-11212
 URL: https://issues.apache.org/jira/browse/SPARK-11212
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965133#comment-14965133
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

I guess you used Scala 2.10.4 (default) for the compilation. Can you try to add 
scala-2.11 in the pom.xml properties to build with scala 2.11 and see if it 
helps ?

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965133#comment-14965133
 ] 

Jean-Baptiste Onofré edited comment on SPARK-11193 at 10/20/15 1:58 PM:


I guess you used Scala 2.10.4 (default) for the compilation. Can you try to 
build with -Pscala-2.11 to see if it helps ?


was (Author: jbonofre):
I guess you used Scala 2.10.4 (default) for the compilation. Can you try to add 
scala-2.11 in the pom.xml properties to build with scala 2.11 and see if it 
helps ?

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11212) Make RDD's preferred locations support the executor location and fix ReceiverTracker for multiple executors in a host

2015-10-20 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965136#comment-14965136
 ] 

Apache Spark commented on SPARK-11212:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/9181

> Make RDD's preferred locations support the executor location and fix 
> ReceiverTracker for multiple executors in a host  
> ---
>
> Key: SPARK-11212
> URL: https://issues.apache.org/jira/browse/SPARK-11212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11212) Make RDD's preferred locations support the executor location and fix ReceiverTracker for multiple executors in a host

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11212:


Assignee: (was: Apache Spark)

> Make RDD's preferred locations support the executor location and fix 
> ReceiverTracker for multiple executors in a host  
> ---
>
> Key: SPARK-11212
> URL: https://issues.apache.org/jira/browse/SPARK-11212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11212) Make RDD's preferred locations support the executor location and fix ReceiverTracker for multiple executors in a host

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11212:


Assignee: Apache Spark

> Make RDD's preferred locations support the executor location and fix 
> ReceiverTracker for multiple executors in a host  
> ---
>
> Key: SPARK-11212
> URL: https://issues.apache.org/jira/browse/SPARK-11212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Streaming
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965007#comment-14965007
 ] 

Steve Loughran commented on SPARK-11177:


0-byte files are a troublespot in object stores, as they are often used/abused 
in hadoop fs clients to mimic directories.

One thing to consider is actually skipping 0-byte files on the basis they have 
no relevant data whatsoever

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Input/Output
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965016#comment-14965016
 ] 

Steve Loughran commented on SPARK-11128:


Which Hadoop version is this? This JIRA could be moved there as it is highly 
likely this NPE point may still exist. At the very least the methods could be 
doing some precondition checks and failing with meaningful text

> strange NPE when writing in non-existing S3 bucket
> --
>
> Key: SPARK-11128
> URL: https://issues.apache.org/jira/browse/SPARK-11128
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.1
>Reporter: mathieu despriee
>Priority: Minor
>
> For the record, as it's relatively minor, and related to s3n (not tested with 
> s3a).
> By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
> with a simple df.write.parquet(s3path).
> We got a NPE (see stack trace below), which is very misleading.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11119) cleanup unsafe array and map

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9:
--
Assignee: Wenchen Fan

> cleanup unsafe array and map
> 
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11195) Exception thrown on executor throws ClassNotFound on driver

2015-10-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965044#comment-14965044
 ] 

Sean Owen commented on SPARK-11195:
---

What is in your assembly jar? I assume you're sure it has MyException? what 
other third-party classes? I wonder if you have some class that Spark uses that 
isn't shaded that is interfering.

> Exception thrown on executor throws ClassNotFound on driver
> ---
>
> Key: SPARK-11195
> URL: https://issues.apache.org/jira/browse/SPARK-11195
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Hurshal Patel
>
> I have a minimal repro job
> {code:title=Repro.scala}
> package repro
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkException
> class MyException(message: String) extends Exception(message: String)
> object Repro {
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("MyException ClassNotFound Repro")
> val sc = new SparkContext(conf)
> sc.parallelize(List(1)).map { x =>
>   throw new repro.MyException("this is a failure")
>   true
> }.collect()
>   }
> }
> {code}
> On Spark 1.4.1, I get a task failure with the reason correctly set to 
> MyException.
> On Spark 1.5.1, I _expect_ the same behavior, but instead I get a task 
> failure with an UnknownReason caused by ClassNotFoundException.
>  
> here is the job on vanilla Spark 1.4.1:
> {code:title=spark_1.5.1_log}
> $ ./bin/spark-submit --master local --deploy-mode client --class repro.Repro 
> /home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/10/19 11:55:20 INFO SparkContext: Running Spark version 1.4.1
> 15/10/19 11:55:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/10/19 11:55:22 WARN Utils: Your hostname, choochootrain resolves to a 
> loopback address: 127.0.1.1; using 10.0.1.97 instead (on interface wlan0)
> 15/10/19 11:55:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/10/19 11:55:22 INFO SecurityManager: Changing view acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: Changing modify acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/10/19 11:55:24 INFO Slf4jLogger: Slf4jLogger started
> 15/10/19 11:55:24 INFO Remoting: Starting remoting
> 15/10/19 11:55:24 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@10.0.1.97:46683]
> 15/10/19 11:55:24 INFO Utils: Successfully started service 'sparkDriver' on 
> port 46683.
> 15/10/19 11:55:24 INFO SparkEnv: Registering MapOutputTracker
> 15/10/19 11:55:24 INFO SparkEnv: Registering BlockManagerMaster
> 15/10/19 11:55:24 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/blockmgr-08496143-1d9d-41c8-a581-b6220edf00d5
> 15/10/19 11:55:24 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
> 15/10/19 11:55:25 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/httpd-52c396d2-b47f-45a5-bb76-d10aa864e6d5
> 15/10/19 11:55:25 INFO HttpServer: Starting HTTP Server
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'HTTP file server' 
> on port 47915.
> 15/10/19 11:55:25 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 15/10/19 11:55:25 INFO SparkUI: Started SparkUI at http://10.0.1.97:4040
> 15/10/19 11:55:25 INFO SparkContext: Added JAR 
> file:/home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar at 
> http://10.0.1.97:47915/jars/repro-assembly-0.0.1.jar with timestamp 
> 1445280925969
> 15/10/19 11:55:26 INFO Executor: Starting executor ID driver on host localhost
> 15/10/19 11:55:26 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46569.
> 15/10/19 11:55:26 INFO NettyBlockTransferService: Server created on 46569
> 15/10/19 11:55:26 INFO BlockManagerMaster: Trying to register BlockManager
> 15/10/19 11:55:26 INFO BlockManagerMasterEndpoint: Registering block manager 
> localhost:46569 with 265.4 MB RAM, BlockManagerId(driver, localhost, 46569)
> 15/10/19 11:55:26 INFO BlockManagerMaster: Registered BlockManager
> 15/10/19 11:55:27 INFO SparkContext: Starting job: collect at repro.scala:18
> 15/10/19 11:55:27 INFO DAGScheduler: Got job 0 (collect at repro.scala:18) 
> with 1 output partitions (allowLocal=false)
> 15/10/19 11:55:27 INFO DAGScheduler: Final 

[jira] [Updated] (SPARK-11192) Spark sql seems to leak org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11192:
--
Target Version/s:   (was: 1.5.1)

[~blivingston] don't set Target; it can't target a released version anyway. 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

> Spark sql seems to leak org.apache.spark.sql.execution.ui.SQLTaskMetrics 
> objects over time
> --
>
> Key: SPARK-11192
> URL: https://issues.apache.org/jira/browse/SPARK-11192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> org.apache.spark/spark-sql_2.10 "1.5.1"
> Embedded, in-process spark. Have not tested on standalone or yarn clusters.
>Reporter: Blake Livingston
>Priority: Minor
>
> Noticed that slowly, over the course of a day or two, heap memory usage on a 
> long running spark process increased monotonically.
> After doing a heap dump and examining in jvisualvm, saw there were over 15M 
> org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking 
> over 500MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11166) HIVE ON SPARK : yarn-cluster mode , if memory is busy,have no enough resource. app wil failed

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964998#comment-14964998
 ] 

Steve Loughran commented on SPARK-11166:


see also http://wiki.apache.org/hadoop/InvalidJiraIssues , which is related

> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
> ---
>
> Key: SPARK-11166
> URL: https://issues.apache.org/jira/browse/SPARK-11166
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: yindu_asan
>  Labels: test
>
> HIVE ON SPARK :  yarn-cluster mode , if  memory is busy,have no enough 
> resource. app wil failed
>  ERROR yarn.ApplicationMaster: SparkContext did not initialize after waiting 
> for 10 ms. Please check earlier log output for errors. Failing the 
> application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11195) Exception thrown on executor throws ClassNotFound on driver

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11195:
--
Component/s: Spark Core

> Exception thrown on executor throws ClassNotFound on driver
> ---
>
> Key: SPARK-11195
> URL: https://issues.apache.org/jira/browse/SPARK-11195
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Hurshal Patel
>
> I have a minimal repro job
> {code:title=Repro.scala}
> package repro
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkException
> class MyException(message: String) extends Exception(message: String)
> object Repro {
>   def main(args: Array[String]) {
> val conf = new SparkConf().setAppName("MyException ClassNotFound Repro")
> val sc = new SparkContext(conf)
> sc.parallelize(List(1)).map { x =>
>   throw new repro.MyException("this is a failure")
>   true
> }.collect()
>   }
> }
> {code}
> On Spark 1.4.1, I get a task failure with the reason correctly set to 
> MyException.
> On Spark 1.5.1, I _expect_ the same behavior, but instead I get a task 
> failure with an UnknownReason caused by ClassNotFoundException.
>  
> here is the job on vanilla Spark 1.4.1:
> {code:title=spark_1.5.1_log}
> $ ./bin/spark-submit --master local --deploy-mode client --class repro.Repro 
> /home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/10/19 11:55:20 INFO SparkContext: Running Spark version 1.4.1
> 15/10/19 11:55:21 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 15/10/19 11:55:22 WARN Utils: Your hostname, choochootrain resolves to a 
> loopback address: 127.0.1.1; using 10.0.1.97 instead (on interface wlan0)
> 15/10/19 11:55:22 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> 15/10/19 11:55:22 INFO SecurityManager: Changing view acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: Changing modify acls to: root
> 15/10/19 11:55:22 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(root); users 
> with modify permissions: Set(root)
> 15/10/19 11:55:24 INFO Slf4jLogger: Slf4jLogger started
> 15/10/19 11:55:24 INFO Remoting: Starting remoting
> 15/10/19 11:55:24 INFO Remoting: Remoting started; listening on addresses 
> :[akka.tcp://sparkDriver@10.0.1.97:46683]
> 15/10/19 11:55:24 INFO Utils: Successfully started service 'sparkDriver' on 
> port 46683.
> 15/10/19 11:55:24 INFO SparkEnv: Registering MapOutputTracker
> 15/10/19 11:55:24 INFO SparkEnv: Registering BlockManagerMaster
> 15/10/19 11:55:24 INFO DiskBlockManager: Created local directory at 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/blockmgr-08496143-1d9d-41c8-a581-b6220edf00d5
> 15/10/19 11:55:24 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
> 15/10/19 11:55:25 INFO HttpFileServer: HTTP File server directory is 
> /tmp/spark-0348a320-0ca3-4528-9ab5-9ba37d3c2e07/httpd-52c396d2-b47f-45a5-bb76-d10aa864e6d5
> 15/10/19 11:55:25 INFO HttpServer: Starting HTTP Server
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'HTTP file server' 
> on port 47915.
> 15/10/19 11:55:25 INFO SparkEnv: Registering OutputCommitCoordinator
> 15/10/19 11:55:25 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 15/10/19 11:55:25 INFO SparkUI: Started SparkUI at http://10.0.1.97:4040
> 15/10/19 11:55:25 INFO SparkContext: Added JAR 
> file:/home/nix/repro/target/scala-2.10/repro-assembly-0.0.1.jar at 
> http://10.0.1.97:47915/jars/repro-assembly-0.0.1.jar with timestamp 
> 1445280925969
> 15/10/19 11:55:26 INFO Executor: Starting executor ID driver on host localhost
> 15/10/19 11:55:26 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46569.
> 15/10/19 11:55:26 INFO NettyBlockTransferService: Server created on 46569
> 15/10/19 11:55:26 INFO BlockManagerMaster: Trying to register BlockManager
> 15/10/19 11:55:26 INFO BlockManagerMasterEndpoint: Registering block manager 
> localhost:46569 with 265.4 MB RAM, BlockManagerId(driver, localhost, 46569)
> 15/10/19 11:55:26 INFO BlockManagerMaster: Registered BlockManager
> 15/10/19 11:55:27 INFO SparkContext: Starting job: collect at repro.scala:18
> 15/10/19 11:55:27 INFO DAGScheduler: Got job 0 (collect at repro.scala:18) 
> with 1 output partitions (allowLocal=false)
> 15/10/19 11:55:27 INFO DAGScheduler: Final stage: ResultStage 0(collect at 
> repro.scala:18)
> 15/10/19 11:55:27 INFO DAGScheduler: Parents of final stage: List()
> 15/10/19 11:55:27 INFO DAGScheduler: Missing 

[jira] [Commented] (SPARK-6527) sc.binaryFiles can not access files on s3

2015-10-20 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965006#comment-14965006
 ] 

Steve Loughran commented on SPARK-6527:
---

try using s3a instead of S3n (ideally, hadoop 2.7+); it may have better 
character support. Otherwise, file a JIRa on hadoop common with component = 
{{fs/s3}} listing an example path which isn't valid.

> sc.binaryFiles can not access files on s3
> -
>
> Key: SPARK-6527
> URL: https://issues.apache.org/jira/browse/SPARK-6527
> Project: Spark
>  Issue Type: Bug
>  Components: EC2, Input/Output
>Affects Versions: 1.2.0, 1.3.0
> Environment: I am running Spark on EC2
>Reporter: Zhao Zhang
>Priority: Minor
>
> The sc.binaryFIles() can not access the files stored on s3. It can correctly 
> list the number of files, but report "file does not exist" when processing 
> them. I also tried sc.textFile() which works fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11190:
--
Target Version/s:   (was: 1.5.1)
   Fix Version/s: (was: 1.5.2)
 Component/s: SparkR

[~bilindHajer] please don't set Fix/Target version, but set Component. You 
should read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first.

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 

[jira] [Commented] (SPARK-4368) Ceph integration?

2015-10-20 Thread Serge Smertin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965026#comment-14965026
 ] 

Serge Smertin commented on SPARK-4368:
--

Thank you Steve for bunch of details

> Ceph integration?
> -
>
> Key: SPARK-4368
> URL: https://issues.apache.org/jira/browse/SPARK-4368
> Project: Spark
>  Issue Type: Improvement
>  Components: Input/Output
>Reporter: Serge Smertin
>
> There is a use-case of storing big number of relatively small BLOB objects 
> (2-20Mb), which has to have some ugly workarounds in HDFS environments. There 
> is a need to process those BLOBs close to data themselves, so that's why 
> MapReduce paradigm is good, as it guarantees data locality.
> Ceph seems to be one of the systems that maintains both of the properties 
> (small files and data locality) -  
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2013-July/032119.html. I 
> know already that Spark supports GlusterFS - 
> http://mail-archives.apache.org/mod_mbox/spark-user/201404.mbox/%3ccf657f2b.5b3a1%25ven...@yarcdata.com%3E
> So i wonder, could there be an integration with this storage solution and 
> what could be the effort of doing that? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11211) Kafka - offsetOutOfRange forces to largest

2015-10-20 Thread Daniel Strassler (JIRA)
Daniel Strassler created SPARK-11211:


 Summary: Kafka - offsetOutOfRange forces to largest
 Key: SPARK-11211
 URL: https://issues.apache.org/jira/browse/SPARK-11211
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1, 1.3.1
Reporter: Daniel Strassler


This problem relates to how DStreams using the Direct Approach of connecting to 
a Kafka topic behave when they request an offset that does not exist on the 
topic.  Currently it appears the "auto.offset.reset" configuration value is 
being ignored and the default value of “largest” is always being used.
 
When using the Direct Approach of connecting to a Kafka topic using a DStream, 
even if you have the Kafka configuration "auto.offset.reset" set to smallest, 
the behavior in the event of a kafka.common.OffsetOutOfRangeException exception 
is to move the next offset to be consumed value to the largest value on the 
Kafka topic.  It appears that the exception is being eaten and not propagated 
up to the driver as well, so a work around triggered by the propagation of the 
error can not be implemented either.
 
The current behavior of setting to largest means that any data on the Kafka 
topic at the time of the exception being thrown is skipped(lost) to consumption 
and only data produced to the topic after the exception will be consumed.  Two 
possible fixes are listed below.
 
Fix 1:  When “auto.offset.reset" is set to “smallest”, the DStream should set 
the next consumed offset to be the smallest offset value on the Kafka topic.
 
Fix 2:  Propagate the error to the Driver to allow it to react as it deems 
appropriate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11203) UDF doesn't support charType column and lit function doesn't allow charType as argument

2015-10-20 Thread M Bharat lal (JIRA)
M Bharat lal created SPARK-11203:


 Summary: UDF doesn't support charType column and lit function 
doesn't allow charType as argument
 Key: SPARK-11203
 URL: https://issues.apache.org/jira/browse/SPARK-11203
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.0
Reporter: M Bharat lal
Priority: Minor


We have two issues

1) We cannot create dataframe with Char Type , see below example

scala> val employee = 
sqlContext.createDataFrame(Seq((1,"John"))).toDF("id","name")
employee: org.apache.spark.sql.DataFrame = [id: int, name: string]

scala> employee.withColumn("grade",lit('A'))
java.lang.RuntimeException: Unsupported literal type class java.lang.Character A
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)
at org.apache.spark.sql.functions$.lit(functions.scala:89)


2) we have a function which takes string and char as input parameters and 
returns position of char in given string.  
 
registered function  as UDF and called the same UDF with characters literal 
which gave the below exception. Literal function doesn't support character as 
argument


scala> def strPos(x:String,c:Char):Integer = {x.indexOf(c)}
strPos: (x: String, c: Char)Integer

scala> sqlContext.udf.register("strPos",strPos _)
res13: org.apache.spark.sql.UserDefinedFunction = 
UserDefinedFunction(,IntegerType,List())

scala> df.select( callUDF("strPos",$"name",lit('J')))
java.lang.RuntimeException: Unsupported literal type class java.lang.Character J
at 
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:49)

Can you please add this support or let us know if there is any other work 
around to achieve this




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-10-20 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-8332.
--
Resolution: Not A Problem

> NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> --
>
> Key: SPARK-8332
> URL: https://issues.apache.org/jira/browse/SPARK-8332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0
>Reporter: Tao Li
>Priority: Critical
>  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson
>
> I complied new spark 1.4.0 version. 
> But when I run a simple WordCount demo, it throws NoSuchMethodError 
> {code}
> java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> {code}
> I found out that the default "fasterxml.jackson.version" is 2.4.4. 
> Is there any wrong or conflict with the jackson version? 
> Or is there possibly some project maven dependency containing the wrong 
> version of jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-20 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964762#comment-14964762
 ] 

Sean Owen commented on SPARK-11016:
---

Yeah that's a good start. It's bridging Kryo with the particular serialization 
methods exposed by this class, though I think so far you are only supporting 
{{RoaringBitmap}}? because these methods aren't defined by an interface I think 
you'd have to write a glue class like you have here for each one.

However because they implement {{Externalizable}} which just delegates to these 
custom methods, it seems like you could do something quite similar with 
{{ObjectInput}} and {{ObjectOutput}} to support any {{Externalizable}} object 
and trivially support the other RoaringBitmap classes, which will be necessary 
anyway. How about going that way? I'd support that.

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11192) When graphite metric sink is enabled, spark sql leaks org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time

2015-10-20 Thread Blake Livingston (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Livingston updated SPARK-11192:
-
Description: 
Noticed that slowly, over the course of a day or two, heap memory usage on a 
long running spark process increased monotonically.
After doing a heap dump and examining in jvisualvm, saw there were over 15M 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 
500MB.



  was:
Noticed that slowly, over the course of a day or two, heap memory usage on a 
long running spark process increased monotonically.
After doing a heap dump and examining in jvisualvm, saw there were over 15M 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 
500MB.

Accumulation does not occur when I removed metrics.properties.

metrics.properties content:

*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=x
*.sink.graphite.port=2003
*.sink.graphite.period=10

master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource


> When graphite metric sink is enabled, spark sql leaks 
> org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time
> 
>
> Key: SPARK-11192
> URL: https://issues.apache.org/jira/browse/SPARK-11192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> org.apache.spark/spark-sql_2.10 "1.5.1"
> Embedded, in-process spark. Have not tested on standalone or yarn clusters.
>Reporter: Blake Livingston
>Priority: Minor
>
> Noticed that slowly, over the course of a day or two, heap memory usage on a 
> long running spark process increased monotonically.
> After doing a heap dump and examining in jvisualvm, saw there were over 15M 
> org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking 
> over 500MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11192) Spark sql seems to leak org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time

2015-10-20 Thread Blake Livingston (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Livingston updated SPARK-11192:
-
Summary: Spark sql seems to leak 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time  (was: When 
graphite metric sink is enabled, spark sql leaks 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time)

> Spark sql seems to leak org.apache.spark.sql.execution.ui.SQLTaskMetrics 
> objects over time
> --
>
> Key: SPARK-11192
> URL: https://issues.apache.org/jira/browse/SPARK-11192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> org.apache.spark/spark-sql_2.10 "1.5.1"
> Embedded, in-process spark. Have not tested on standalone or yarn clusters.
>Reporter: Blake Livingston
>Priority: Minor
>
> Noticed that slowly, over the course of a day or two, heap memory usage on a 
> long running spark process increased monotonically.
> After doing a heap dump and examining in jvisualvm, saw there were over 15M 
> org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking 
> over 500MB.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10903) Make sqlContext global

2015-10-20 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964642#comment-14964642
 ] 

Felix Cheung commented on SPARK-10903:
--

Working from SparkR package version in this commit: 
https://github.com/felixcheung/spark/commit/efedce53a315d7ce23a53145e3de100d2a471690


> Make sqlContext global 
> ---
>
> Key: SPARK-10903
> URL: https://issues.apache.org/jira/browse/SPARK-10903
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Narine Kokhlikyan
>Priority: Minor
>
> Make sqlContext global so that we don't have to always specify it.
> e.g. createDataFrame(iris) instead of createDataFrame(sqlContext, iris)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-8071) Run PySpark dataframe.rollup/cube test failed

2015-10-20 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-8071.
-
Resolution: Won't Fix

> Run PySpark dataframe.rollup/cube test failed
> -
>
> Key: SPARK-8071
> URL: https://issues.apache.org/jira/browse/SPARK-8071
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
> Environment: OS: SUSE 11 SP3; JDK: 1.8.0_40; Python: 2.6.8; Hadoop: 
> 2.7.0; Spark: master branch
>Reporter: Weizhong
>Priority: Minor
>
> I run test for Spark, and failed on PySpark, details are:
> {code}
> File "/xxx/Spark/python/pyspark/sql/dataframe.py", line 837, in 
> pyspark.sql.dataframe.DataFrame.cube
> Failed example:
>   df.cube('name', df.age).count().show()
> Exception raised:
>   Traceback (most recent call last):
>File "/usr/lib64/python2.6/doctest.py", line 1253, in __run
> compileflags, 1) in test.globs
>File "", line 1, in 
> 
> df.cube('name', df.age).count().show()
>File "/xxx/Spark/python/pyspark/sql/dataframe.py", line 291, in show
> print(self._jdf.showString(n))
>File "/xxx/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", 
> line 538, in __call__
> self.target_id, self.name)
>File "/xxx/Spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 
> 300, in get_return_value
> format(target_id, '.', name), value)
>   Py4JJavaError: An error occurred while calling o212.showString.
>   : java.lang.AssertionError: assertion failed: No plan for Cube 
> [name#1,age#0], [name#1,age#0,COUNT(1) AS count#27L], grouping__id#28
>LogicalRDD [age#0,name#1], MapPartitionsRDD[7] at applySchemaToPythonRDD 
> at NativeMethodAccessorImpl.java:-2
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:54)
> at 
> org.apache.spark.sql.execution.SparkStrategies$BasicOperators$.apply(SparkStrategies.scala:312)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:58)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at 
> org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:59)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan$lzycompute(SQLContext.scala:913)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.sparkPlan(SQLContext.scala:911)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan$lzycompute(SQLContext.scala:917)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.executedPlan(SQLContext.scala:917)
> at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1255)
> at org.apache.spark.sql.DataFrame.head(DataFrame.scala:1189)
> at org.apache.spark.sql.DataFrame.take(DataFrame.scala:1248)
> at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:176)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
> at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
> at py4j.Gateway.invoke(Gateway.java:259)
> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
> at py4j.commands.CallCommand.execute(CallCommand.java:79)
> at py4j.GatewayConnection.run(GatewayConnection.java:207)
> at java.lang.Thread.run(Thread.java:745)
> **
>1 of   1 in pyspark.sql.dataframe.DataFrame.cube
>1 of   1 in pyspark.sql.dataframe.DataFrame.rollup
> ***Test Failed*** 2 failures.
> {code}
> cc [~davies]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9299) percentile and percentile_approx aggregate functions

2015-10-20 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964656#comment-14964656
 ] 

Jeff Zhang commented on SPARK-9299:
---

Link with SPARK-6761 as they can share the same algorithm

> percentile and percentile_approx aggregate functions
> 
>
> Key: SPARK-9299
> URL: https://issues.apache.org/jira/browse/SPARK-9299
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> A short introduction on how to build aggregate functions based on our new 
> interface can be found at 
> https://issues.apache.org/jira/browse/SPARK-4366?focusedCommentId=14639921=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14639921.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10045) Add support for DataFrameStatFunctions in SparkR

2015-10-20 Thread Sun Rui (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sun Rui closed SPARK-10045.
---
Resolution: Fixed

all sub-tasks are finished.

> Add support for DataFrameStatFunctions in SparkR
> 
>
> Key: SPARK-10045
> URL: https://issues.apache.org/jira/browse/SPARK-10045
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Sun Rui
>
> The stat functions are defined in 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions.
> Currently only crosstab() is supported.
> Functions to be supported include:
> corr, cov, freqItems



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11213) Documentation for remote spark Submit for R Scripts from 1.5 on CDH 5.4

2015-10-20 Thread Ankit (JIRA)
Ankit created SPARK-11213:
-

 Summary: Documentation for remote spark Submit for R Scripts from 
1.5 on CDH 5.4
 Key: SPARK-11213
 URL: https://issues.apache.org/jira/browse/SPARK-11213
 Project: Spark
  Issue Type: Bug
Reporter: Ankit


Hello Guys,

We have a Cloudera Dist 5.4 ad it has spark 1.3 version 

Issue 

we have data sciencetis work on R Script so was searching a ways to submit a r 
script using ozie or local spark submit to a remoter Yarn resource manager can 
anyone share the steps to do the same it really difficult to guess the steps , 

Thanks in advance 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965210#comment-14965210
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

It looks to work fine to me:

{code}
jbonofre@latitude:~/Workspace/spark/bin$ ./run-example 
streaming.KinesisWordCountASL jbonofre-test kinesis-connector 
https://kinesis.us-east-1.amazonaws.com
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/20 16:48:09 INFO StreamingExamples: Setting log level to [WARN] for 
streaming example. To override add a custom log4j.properties to the classpath.
15/10/20 16:48:10 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/10/20 16:48:10 WARN Utils: Your hostname, latitude resolves to a loopback 
address: 127.0.1.1; using 192.168.134.10 instead (on interface eth0)
15/10/20 16:48:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another 
address
15/10/20 16:48:11 WARN MetricsSystem: Using default name DAGScheduler for 
source because spark.app.id is not set.
---
Time: 1445352496000 ms
---

---
Time: 1445352498000 ms
---

---
Time: 144535250 ms
---

{/code}

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-10-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10447.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8615
[https://github.com/apache/spark/pull/8615]

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
> Fix For: 1.6.0
>
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-10-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-10447:
---
Assignee: holdenk

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>Assignee: holdenk
> Fix For: 1.6.0
>
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11214) Join with Unicode-String results wrong empty

2015-10-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-11214:
--

Assignee: Josh Rosen

> Join with Unicode-String results wrong empty
> 
>
> Key: SPARK-11214
> URL: https://issues.apache.org/jira/browse/SPARK-11214
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Hans Fischer
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.1
>
>
> I created a join that should clearly result in a single row but return: 
> empty. Could someone validate this bug?
> hiveContext.sql('SELECT * FROM (SELECT "c" AS a) AS a JOIN (SELECT "c" AS b) 
> AS b ON a.a = b.b').take(10)
> result: []
> kind regards
> Hans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11201) StreamContext.getOrCreate is broken is yarn-client mode

2015-10-20 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965529#comment-14965529
 ] 

Marcelo Vanzin commented on SPARK-11201:


As I mentioned in the PR, I believe the fix for SPARK-10812 takes care of this, 
we can just backport it.

> StreamContext.getOrCreate is broken is yarn-client mode
> ---
>
> Key: SPARK-11201
> URL: https://issues.apache.org/jira/browse/SPARK-11201
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Hari Shreedharan
>
> If {{StreamingContext.getOrCreate}} (or the constructors that create the 
> Hadoop {{Configuration}} is used, {{SparkHadoopUtil.get.conf}}  is called 
> before {{SparkContext}} is created - when SPARK_YARN_MODE is set. So in that 
> case {{SparkHadoopUtil.get}} creates a {{SparkHadoopUtil}} instance instead 
> of {{YarnSparkHadoopUtil}} instance.
> So, in yarn-client mode, a class cast exception gets thrown from 
> {{Client.scala}}:
> {code}
> java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot 
> be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
>   at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:169)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:266)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:631)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:120)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:523)
>   at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:854)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>   at 
> com.cloudera.test.LongRunningApp$.com$cloudera$test$LongRunningApp$$createCheckpoint$1(LongRunningApp.scala:33)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:844)
>   at com.cloudera.test.LongRunningApp$.main(LongRunningApp.scala:90)
>   at com.cloudera.test.LongRunningApp.main(LongRunningApp.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965561#comment-14965561
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

I only have one warning during compilation (scala 2.10.4/Java 8), and it's not 
the same as you have:

{code}
[INFO] --- scala-maven-plugin:3.2.2:compile (scala-compile-first) @ 
spark-streaming-kinesis-asl_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal 
incremental compile
[INFO] Using incremental compilation
[INFO] Compiling 8 Scala sources and 1 Java source to 
/home/jbonofre/Workspace/spark/extras/kinesis-asl/target/scala-2.10/classes...
[WARNING] warning: [options] bootstrap class path not set in conjunction with 
-source 1.7
[WARNING] 1 warning
[INFO]
[INFO] --- maven-compiler-plugin:3.3:compile (default-compile) @ 
spark-streaming-kinesis-asl_2.10 ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 1 source file to 
/home/jbonofre/Workspace/spark/extras/kinesis-asl/target/scala-2.10/classes
[INFO]
[INFO] --- maven-antrun-plugin:1.8:run (create-tmp-dir) @ 
spark-streaming-kinesis-asl_2.10 ---
[INFO] Executing tasks

main:
[mkdir] Created dir: 
/home/jbonofre/Workspace/spark/extras/kinesis-asl/target/tmp
[INFO] Executed tasks
[INFO]
[INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ 
spark-streaming-kinesis-asl_2.10 ---
[INFO] Using 'UTF-8' encoding to copy filtered resources.
[INFO] Copying 1 resource
[INFO] Copying 3 resources
[INFO]
[INFO] --- scala-maven-plugin:3.2.2:testCompile (scala-test-compile-first) @ 
spark-streaming-kinesis-asl_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal 
incremental compile
[INFO] Using incremental compilation
[INFO] Compiling 4 Scala sources and 1 Java source to 
/home/jbonofre/Workspace/spark/extras/kinesis-asl/target/scala-2.10/test-classes...
[WARNING] 
/home/jbonofre/Workspace/spark/extras/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala:100:
 method createStream in object KinesisUtils is deprecated: use other forms of 
createStream
[WARNING] val kinesisStream1 = KinesisUtils.createStream(ssc, 
"mySparkStream",
[WARNING]   ^
[WARNING] one warning found
[WARNING] warning: [options] bootstrap class path not set in conjunction with 
-source 1.7
[WARNING] 1 warning
[INFO]
[INFO] --- maven-compiler-plugin:3.3:testCompile (default-testCompile) @ 
spark-streaming-kinesis-asl_2.10 ---
[INFO] Changes detected - recompiling the module!
[INFO] Compiling 1 source file to 
/home/jbonofre/Workspace/spark/extras/kinesis-asl/target/scala-2.10/test-classes
[INFO]
{code}

Let me try the other combinations (Java7/Scala 2.11).

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> 

[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965567#comment-14965567
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

No warning with Java7 neither.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread Phil Kallos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965599#comment-14965599
 ] 

Phil Kallos commented on SPARK-11193:
-

Have you tried the last combination? scala 2.10.4 + java 7?

Anecdotally, I have at least one other developer on my team that is able to 
reproduce. Let me know if there's specific environment information I could give 
you that would help?

Meanwhile I will try different java /scala/spark version combinations to see if 
I can get it running on my box.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965584#comment-14965584
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

OK, I have the same warning as you have if I use Scala 2.11. Let me try with 
Scala 2.11 if I can reproduce your issue.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11105) Disitribute the log4j.properties files from the client to the executors

2015-10-20 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-11105.

   Resolution: Fixed
 Assignee: Srinivasa Reddy Vundela
Fix Version/s: 1.6.0

> Disitribute the log4j.properties files from the client to the executors
> ---
>
> Key: SPARK-11105
> URL: https://issues.apache.org/jira/browse/SPARK-11105
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
>Reporter: Srinivasa Reddy Vundela
>Assignee: Srinivasa Reddy Vundela
>Priority: Minor
> Fix For: 1.6.0
>
>
> The log4j.properties file from the client is not distributed to the 
> executors. This means that the client settings are not applied to the 
> executors and they run with the default settings.
> This affects troubleshooting and data gathering.
> The workaround is to use the --files option for spark-submit to propagate the 
> log4j.properties file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10309) Some tasks failed with Unable to acquire memory

2015-10-20 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965510#comment-14965510
 ] 

Jerry Lam edited comment on SPARK-10309 at 10/20/15 6:25 PM:
-

Same issue, I got the following stacktrace:
{noformat}
15/10/20 18:20:43 INFO UnsafeExternalSorter: Thread 64 spilling sort data of 
64.0 KB to disk (0  time so far)
15/10/20 18:20:43 ERROR Executor: Exception in task 11.3 in stage 2.0 (TID 514)
java.io.IOException: Unable to acquire 67108864 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:332)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:461)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:139)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{noformat}


was (Author: superwai):
Same issue, I got the following stacktrace:

15/10/20 18:20:43 INFO UnsafeExternalSorter: Thread 64 spilling sort data of 
64.0 KB to disk (0  time so far)
15/10/20 18:20:43 ERROR Executor: Exception in task 11.3 in stage 2.0 (TID 514)
java.io.IOException: Unable to acquire 67108864 bytes of memory
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:332)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertKVRecord(UnsafeExternalSorter.java:461)
at 
org.apache.spark.sql.execution.UnsafeKVExternalSorter.insertKV(UnsafeKVExternalSorter.java:139)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.switchToSortBasedAggregation(TungstenAggregationIterator.scala:489)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:379)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1.org$apache$spark$sql$execution$aggregate$TungstenAggregate$$anonfun$$executePartition$1(TungstenAggregate.scala:110)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.sql.execution.aggregate.TungstenAggregate$$anonfun$doExecute$1$$anonfun$2.apply(TungstenAggregate.scala:119)
at 
org.apache.spark.rdd.MapPartitionsWithPreparationRDD.compute(MapPartitionsWithPreparationRDD.scala:64)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at 

[jira] [Assigned] (SPARK-11199) Improve R context management story and add getOrCreate

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11199:


Assignee: Felix Cheung  (was: Apache Spark)

> Improve R context management story and add getOrCreate
> --
>
> Key: SPARK-11199
> URL: https://issues.apache.org/jira/browse/SPARK-11199
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
>
> Similar to SPARK-4
> Also from discussion in SPARK-10903:
> "
> Hossein Falaki added a comment - 08/Oct/15 13:06
> +1 We have seen a lot of questions from new SparkR users about the life cycle 
> of the context. 
> My question is: are we going to remove or deprecate sparkRSQL.init()? I 
> suggest we should, because right now calling that method creates a new Java 
> SQLContext object, and having two of them prevents users form viewing temp 
> tables.
> Felix Cheung added a comment - 08/Oct/15 17:13
> +1 perhaps sparkR.init() should create sqlContext and/or hiveCtx together.
> But Hossein Falaki, as of now calling sparkRSQL.init() should return the same 
> one as you can see 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L224
> Hossein Falaki added a comment - 08/Oct/15 17:16
> I meant the SQL Context: 
> https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R#L236
> This call should have been "getOrCreate."
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965605#comment-14965605
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

Can you provide:
- Java version and provider
- Scala version used for the compilation
- Spark topology (standalone, Yarn, Mesos)
?

It would be helpful for me to try to reproduce.

I'm testing Java7 (Oracle) now.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10646) Bivariate Statistics: Pearson's Chi-Squared goodness of fit test

2015-10-20 Thread Jihong MA (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965298#comment-14965298
 ] 

Jihong MA commented on SPARK-10646:
---

[~mengxr] to add chi-squared test support through UDAF framework, I will need 
to keep around a HashMap for tracking counts for each category encountered, I 
prototyped it using ImperativeAggregate interface, and realized current UDAF 
infrastructure doesn't support varied length of aggregation attribute buffer 
and cause GC pressure as well even with the support in place. I discussed with 
[~yhuai] offline on this sometime back. it looks adding it through UDAF is not 
feasible at this point, please kindly let me know how we should proceed? thanks!

> Bivariate Statistics: Pearson's Chi-Squared goodness of fit test
> 
>
> Key: SPARK-10646
> URL: https://issues.apache.org/jira/browse/SPARK-10646
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>Assignee: Jihong MA
>
> Pearson's chi-squared goodness of fit test for observed against the expected 
> distribution.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-20 Thread mathieu despriee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965351#comment-14965351
 ] 

mathieu despriee commented on SPARK-11128:
--

hadoop 2.4

just tested right now with hadoop 2.6 and hadoop-aws-2.6.0.jar, and it works as 
one would expect : 
org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.ServiceException: 
Service Error Message. -- ResponseCode: 404, ResponseStatus: Not Found, XML 
Error Message: NoSuchBucket...



> strange NPE when writing in non-existing S3 bucket
> --
>
> Key: SPARK-11128
> URL: https://issues.apache.org/jira/browse/SPARK-11128
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.1
>Reporter: mathieu despriee
>Priority: Minor
>
> For the record, as it's relatively minor, and related to s3n (not tested with 
> s3a).
> By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
> with a simple df.write.parquet(s3path).
> We got a NPE (see stack trace below), which is very misleading.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10953) Benchmark codegen vs. hand-written code for univariate statistics

2015-10-20 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14954447#comment-14954447
 ] 

Reynold Xin edited comment on SPARK-10953 at 10/20/15 7:13 PM:
---

[~mengxr] I had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

{code}
#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s
{code}

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.


was (Author: jihongma):
[~mengxr] I had a quick run on my laptop, with a stddev implementation based on 
ImperativeAggregate vs. DeclarativeAggregate, where ImperativeAggregate still 
use SortBasedAggregate at runtime, and  DeclarativeAggregate uses 
TungstenAggregate. 

with a single double column of DF (cached): 

#rowsImperativeAggregate DeclarativeAggregate
100 58ms   0.1s
1000   0.4s 0.6s
1 4s 7s

overall it seems ImperativeAggregate perform better.  if enabling 
TungstenAggregate support for ImperativeAggregate is in good shape (PR 9038), I 
will merge them in and have another try.

> Benchmark codegen vs. hand-written code for univariate statistics
> -
>
> Key: SPARK-10953
> URL: https://issues.apache.org/jira/browse/SPARK-10953
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Xiangrui Meng
>Assignee: Jihong MA
> Fix For: 1.6.0
>
>
> I checked the generated code for a simple stddev_pop call:
> {code}
> val df = sqlContext.range(100)
> df.select(stddev_pop(col("id"))).show()
> {code}
> This is the generated code for the merge part, which is very long and 
> complex. I'm not sure whether we can get benefit from the code generation for 
> univariate statistics. We should benchmark it against Scala implementation.
> {code}
> 15/10/06 10:10:57 DEBUG GenerateMutableProjection: code for if 
> (isnull(input[1, DoubleType])) cast(0 as double) else input[1, DoubleType],if 
> (isnull(input[1, DoubleType])) input[6, DoubleType] else if (isnull(input[6, 
> DoubleType])) input[1, DoubleType] else (input[1, DoubleType] + input[6, 
> DoubleType]),if (isnull(input[3, DoubleType])) cast(0 as double) else 
> input[3, DoubleType],if (isnull(input[3, DoubleType])) input[8, DoubleType] 
> else if (isnull(input[8, DoubleType])) input[3, DoubleType] else (((input[3, 
> DoubleType] * input[0, DoubleType]) + (input[8, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType])),if 
> (isnull(input[4, DoubleType])) input[9, DoubleType] else if (isnull(input[9, 
> DoubleType])) input[4, DoubleType] else ((input[4, DoubleType] + input[9, 
> DoubleType]) + input[8, DoubleType] - input[2, DoubleType]) * (input[8, 
> DoubleType] - input[2, DoubleType])) * (input[0, DoubleType] * input[6, 
> DoubleType])) / (input[0, DoubleType] + input[6, DoubleType]))):
> public Object generate(org.apache.spark.sql.catalyst.expressions.Expression[] 
> expr) {
>   return new SpecificMutableProjection(expr);
> }
> class SpecificMutableProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
>   private org.apache.spark.sql.catalyst.expressions.Expression[] expressions;
>   private org.apache.spark.sql.catalyst.expressions.MutableRow mutableRow;
>   public 
> SpecificMutableProjection(org.apache.spark.sql.catalyst.expressions.Expression[]
>  expr) {
> expressions = expr;
> mutableRow = new 
> org.apache.spark.sql.catalyst.expressions.GenericMutableRow(5);
>   }
>   public 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
> target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
> mutableRow = row;
> return this;
>   }
>   /* Provide immutable access to the last projected row. */
>   public InternalRow currentValue() {
> return (InternalRow) mutableRow;
>   }
>   public Object apply(Object _i) {
> InternalRow i = (InternalRow) _i;
> /* if (isnull(input[1, DoubleType])) cast(0 as double) else input[1, 
> DoubleType] */
> /* isnull(input[1, DoubleType]) */
> /* input[1, DoubleType] */
> boolean isNull4 = i.isNullAt(1);
> 

[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread Phil Kallos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965612#comment-14965612
 ] 

Phil Kallos commented on SPARK-11193:
-

{noformat}
$ java -version
java version "1.7.0_75"
Java(TM) SE Runtime Environment (build 1.7.0_75-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.75-b04, mixed mode)
{noformat}

{noformat}
$ scalac -version
Scala compiler version 2.10.4 -- Copyright 2002-2013, LAMP/EPFL
{noformat}

Spark topology is generally YARN (again because of Amazon EMR) but I have also 
tried standalone.

The exact command I built with was
{noformat}
spark git:(4f894dd) $ mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean 
package
{noformat}

Have also tried the following:
{noformat}
mvn -Pyarn -Pkinesis-asl -Phadoop-2.2 -DskipTests clean package
{noformat}

{noformat}
mvn -Pyarn -Pkinesis-asl -Phadoop-2.2 -DskipTests clean package
{noformat}

{noformat}
./dev/change-scala-version.sh 2.11
mvn -Pyarn -Phadoop-2.6 -Dscala-2.11 -Pkinesis-asl -DskipTests clean package
{noformat}

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3842) Remove the hacks for Python callback server in py4j

2015-10-20 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3842:
--
Assignee: holdenk  (was: Davies Liu)

> Remove the hacks for Python callback server in py4j 
> 
>
> Key: SPARK-3842
> URL: https://issues.apache.org/jira/browse/SPARK-3842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Reporter: Davies Liu
>Assignee: holdenk
>Priority: Minor
>
> There are three hacks  while create Python API for Streaming 
> (https://github.com/apache/spark/pull/2538) :
> 1. daemonize the callback server thread, by 'thread.daemon = True' before 
> start it  https://github.com/bartdag/py4j/issues/147
> 2. let callback server bind to random port, then update the Java callback 
> client with real port. https://github.com/bartdag/py4j/issues/148
> 3. start the callback server later. https://github.com/bartdag/py4j/issues/149
> These hacks should be removed after py4j has fix these issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3842) Remove the hacks for Python callback server in py4j

2015-10-20 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3842:
---

Assignee: Davies Liu  (was: Apache Spark)

> Remove the hacks for Python callback server in py4j 
> 
>
> Key: SPARK-3842
> URL: https://issues.apache.org/jira/browse/SPARK-3842
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Reporter: Davies Liu
>Assignee: Davies Liu
>Priority: Minor
>
> There are three hacks  while create Python API for Streaming 
> (https://github.com/apache/spark/pull/2538) :
> 1. daemonize the callback server thread, by 'thread.daemon = True' before 
> start it  https://github.com/bartdag/py4j/issues/147
> 2. let callback server bind to random port, then update the Java callback 
> client with real port. https://github.com/bartdag/py4j/issues/148
> 3. start the callback server later. https://github.com/bartdag/py4j/issues/149
> These hacks should be removed after py4j has fix these issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8332) NoSuchMethodError: com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer

2015-10-20 Thread Jonathan Kelly (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965484#comment-14965484
 ] 

Jonathan Kelly commented on SPARK-8332:
---

Oh, you might be right. I was thinking that I had tried compiling Spark with a 
newer version of Jackson (2.5.3) and still ran into problems, but now I'm 
second guessing myself. It might have only been that my classpath had both 
versions.

> NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> --
>
> Key: SPARK-8332
> URL: https://issues.apache.org/jira/browse/SPARK-8332
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
> Environment: spark 1.4 & hadoop 2.3.0-cdh5.0.0
>Reporter: Tao Li
>Priority: Critical
>  Labels: 1.4.0, NoSuchMethodError, com.fasterxml.jackson
>
> I complied new spark 1.4.0 version. 
> But when I run a simple WordCount demo, it throws NoSuchMethodError 
> {code}
> java.lang.NoSuchMethodError: 
> com.fasterxml.jackson.module.scala.deser.BigDecimalDeserializer
> {code}
> I found out that the default "fasterxml.jackson.version" is 2.4.4. 
> Is there any wrong or conflict with the jackson version? 
> Or is there possibly some project maven dependency containing the wrong 
> version of jackson?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965530#comment-14965530
 ] 

Jean-Baptiste Onofré commented on SPARK-11193:
--

The warning is interesting, let me check if I have the same on my build.

I keep you posted.

> Spark 1.5+ Kinesis Streaming - ClassCastException when starting 
> KinesisReceiver
> ---
>
> Key: SPARK-11193
> URL: https://issues.apache.org/jira/browse/SPARK-11193
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Phil Kallos
> Attachments: screen.png
>
>
> After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
> Spark Streaming application, and am being consistently greeted with this 
> exception:
> java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
> to scala.collection.mutable.SynchronizedMap
>   at 
> org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
>   at 
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at 
> org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:88)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>   at java.lang.Thread.run(Thread.java:745)
> Worth noting that I am able to reproduce this issue locally, and also on 
> Amazon EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).
> Also, I am not able to run the included kinesis-asl example.
> Built locally using:
> git checkout v1.5.1
> mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package
> Example run command:
> bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
> https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11201) StreamContext.getOrCreate is broken is yarn-client mode

2015-10-20 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965533#comment-14965533
 ] 

Hari Shreedharan commented on SPARK-11201:
--

Yep, closed the PR. So I will close this jira as a duplicate


> StreamContext.getOrCreate is broken is yarn-client mode
> ---
>
> Key: SPARK-11201
> URL: https://issues.apache.org/jira/browse/SPARK-11201
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.5.1
>Reporter: Hari Shreedharan
>
> If {{StreamingContext.getOrCreate}} (or the constructors that create the 
> Hadoop {{Configuration}} is used, {{SparkHadoopUtil.get.conf}}  is called 
> before {{SparkContext}} is created - when SPARK_YARN_MODE is set. So in that 
> case {{SparkHadoopUtil.get}} creates a {{SparkHadoopUtil}} instance instead 
> of {{YarnSparkHadoopUtil}} instance.
> So, in yarn-client mode, a class cast exception gets thrown from 
> {{Client.scala}}:
> {code}
> java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot 
> be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
>   at 
> org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:169)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:266)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:631)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:120)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:523)
>   at 
> org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:854)
>   at 
> org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:81)
>   at 
> com.cloudera.test.LongRunningApp$.com$cloudera$test$LongRunningApp$$createCheckpoint$1(LongRunningApp.scala:33)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at 
> com.cloudera.test.LongRunningApp$$anonfun$4.apply(LongRunningApp.scala:90)
>   at scala.Option.getOrElse(Option.scala:120)
>   at 
> org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:844)
>   at com.cloudera.test.LongRunningApp$.main(LongRunningApp.scala:90)
>   at com.cloudera.test.LongRunningApp.main(LongRunningApp.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >