[jira] [Resolved] (SPARK-11286) Make Outbox stopped exception singleton

2015-10-23 Thread Ted Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu resolved SPARK-11286.

Resolution: Won't Fix

> Make Outbox stopped exception singleton
> ---
>
> Key: SPARK-11286
> URL: https://issues.apache.org/jira/browse/SPARK-11286
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Trivial
>
> In two places in Outbox.scala , new SparkException is created for Outbox 
> stopped condition.
> Create a singleton for Outbox stopped exception and use it instead of 
> creating exception every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971520#comment-14971520
 ] 

Ryan Williams commented on SPARK-11162:
---

Makes sense, some googling has left me with the impression that log4j doesn't 
support this, so I guess I'll just modify {{log4j.properties}} going forward, 
thanks.

Answering my other question from earlier: modifying {{conf/log4j.properties}} 
seems to work in yarn-client mode; I guess I'd only tried 
{{$SPARK_HOME/log4j.properties}} previously.

> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11289) Substitute code examples in ML features with include_example

2015-10-23 Thread Xusen Yin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xusen Yin updated SPARK-11289:
--
Description: Substitute code examples with include_example.  (was: 
[~mengxr] I have one question, there are some code examples in the doc that 
does not exist in our example code dir. How to solve the problem? Should I add 
new examples in the examples/src/main/scala/org/apache/spark/examples to 
support those docs?)

> Substitute code examples in ML features with include_example
> 
>
> Key: SPARK-11289
> URL: https://issues.apache.org/jira/browse/SPARK-11289
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xusen Yin
>Priority: Minor
>
> Substitute code examples with include_example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11289) Substitute code examples in ML features with include_example

2015-10-23 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971536#comment-14971536
 ] 

Xusen Yin commented on SPARK-11289:
---

[~mengxr] I have one question, there are some code examples in the doc that 
does not exist in our example code dir. How to solve the problem? Should I add 
new examples in the examples/src/main/scala/org/apache/spark/examples to 
support those docs?

> Substitute code examples in ML features with include_example
> 
>
> Key: SPARK-11289
> URL: https://issues.apache.org/jira/browse/SPARK-11289
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xusen Yin
>Priority: Minor
>
> Substitute code examples with include_example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5210) Support log rolling in EventLogger

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970670#comment-14970670
 ] 

Apache Spark commented on SPARK-5210:
-

User 'XuTingjun' has created a pull request for this issue:
https://github.com/apache/spark/pull/9246

> Support log rolling in EventLogger
> --
>
> Key: SPARK-5210
> URL: https://issues.apache.org/jira/browse/SPARK-5210
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Reporter: Josh Rosen
>
> For long-running Spark applications (e.g. running for days / weeks), the 
> Spark event log may grow to be very large.
> As a result, it would be useful if EventLoggingListener supported log file 
> rolling / rotation.  Adding this feature will involve changes to the 
> HistoryServer in order to be able to load event logs from a sequence of files 
> instead of a single file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11289) Substitute code examples in ML features with include_example

2015-10-23 Thread Xusen Yin (JIRA)
Xusen Yin created SPARK-11289:
-

 Summary: Substitute code examples in ML features with 
include_example
 Key: SPARK-11289
 URL: https://issues.apache.org/jira/browse/SPARK-11289
 Project: Spark
  Issue Type: Sub-task
Reporter: Xusen Yin
Priority: Minor


[~mengxr] I have one question, there are some code examples in the doc that 
does not exist in our example code dir. How to solve the problem? Should I add 
new examples in the examples/src/main/scala/org/apache/spark/examples to 
support those docs?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11277:


Assignee: Apache Spark

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Assignee: Apache Spark
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-11279:
--

 Summary: Add DataFrame#toDF in PySpark
 Key: SPARK-11279
 URL: https://issues.apache.org/jira/browse/SPARK-11279
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Jeff Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10857) SQL injection bug in JdbcDialect.getTableExistsQuery()

2015-10-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970836#comment-14970836
 ] 

Sean Owen commented on SPARK-10857:
---

Rick you're saying that this code path only comes up when the parser is 
certainly dealing with a table name, like in DDL statements? and not just in 
parsing "SELECT * from (table)"? (You probably know the code best here given 
you've studied it at close range.)

> SQL injection bug in JdbcDialect.getTableExistsQuery()
> --
>
> Key: SPARK-10857
> URL: https://issues.apache.org/jira/browse/SPARK-10857
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: Rick Hillegas
>Priority: Minor
>
> All of the implementations of this method involve constructing a query by 
> concatenating boilerplate text with a user-supplied name. This looks like a 
> SQL injection bug to me.
> A better solution would be to call java.sql.DatabaseMetaData.getTables() to 
> implement this method, using the catalog and schema which are available from 
> Connection.getCatalog() and Connection.getSchema(). This would not work on 
> Java 6 because Connection.getSchema() was introduced in Java 7. However, the 
> solution would work for more modern JVMs. Limiting the vulnerability to 
> obsolete JVMs would at least be an improvement over the current situation. 
> Java 6 has been end-of-lifed and is not an appropriate platform for users who 
> are concerned about security.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11229.
---
   Resolution: Cannot Reproduce
Fix Version/s: (was: 1.6.0)

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622)
>   at 
> 

[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-23 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970839#comment-14970839
 ] 

Maciej Szymkiewicz commented on SPARK-11167:


spark-csv has a much simpler job to do and everything it does is already 
covered by basic R behavior.  Tightest type here would probably most likely 
mean Any which is neither allowed or useful.

I think the best solution in this case could be a warning when data frame 
contains complex types and user doesn't provide schema. And maybe some tool 
which could replace debug.TypeCheck. Anyone can explain why it 'no longer 
applies in the new "Tungsten" world'? 

https://github.com/apache/spark/pull/8043


> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971188#comment-14971188
 ] 

Steve Loughran commented on SPARK-11265:


Pull request is : https://github.com/apache/spark/pull/9232

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10277) Add @since annotation to pyspark.mllib.regression

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10277.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8684
[https://github.com/apache/spark/pull/8684]

> Add @since annotation to pyspark.mllib.regression
> -
>
> Key: SPARK-10277
> URL: https://issues.apache.org/jira/browse/SPARK-10277
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971217#comment-14971217
 ] 

Apache Spark commented on SPARK-7970:
-

User 'nitin2goyal' has created a pull request for this issue:
https://github.com/apache/spark/pull/9253

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7970:
---

Assignee: (was: Apache Spark)

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7970:
---

Assignee: Apache Spark

> Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
> --
>
> Key: SPARK-7970
> URL: https://issues.apache.org/jira/browse/SPARK-7970
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 1.2.0, 1.3.0
>Reporter: Nitin Goyal
>Assignee: Apache Spark
> Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot 
> 2015-05-27 at 11.07.02 pm.png
>
>
> Closure cleaner slows down the execution of Spark SQL queries fired on union 
> of RDDs. The time increases linearly at driver side with number of RDDs 
> unioned. Refer following thread for more context :-
> http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html
> As can be seen in attached screenshots of Jprofiler, lot of time is getting 
> consumed in "getClassReader" method of ClosureCleaner and rest in 
> "ensureSerializable" (atleast in my case)
> This can be fixed in two ways (as per my current understanding) :-
> 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create 
> MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls 
> ClosureCleaner clean method (See PR - 
> https://github.com/apache/spark/pull/6256).
> 2. Fix at Spark core level -
>   (i) Make "checkSerializable" property driven in SparkContext's clean method
>   (ii) Somehow cache classreader for last 'n' classes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:01 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal even I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11265:


Assignee: Apache Spark

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>Assignee: Apache Spark
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971190#comment-14971190
 ] 

Apache Spark commented on SPARK-11265:
--

User 'steveloughran' has created a pull request for this issue:
https://github.com/apache/spark/pull/9232

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11265:


Assignee: (was: Apache Spark)

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6723.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 6785
[https://github.com/apache/spark/pull/6785]

> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10610) Using AppName instead of AppId in the name of all metrics

2015-10-23 Thread Yi Tian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yi Tian updated SPARK-10610:

Summary: Using AppName instead of AppId in the name of all metrics  (was: 
Using AppName instead AppId in the name of all metrics)

> Using AppName instead of AppId in the name of all metrics
> -
>
> Key: SPARK-10610
> URL: https://issues.apache.org/jira/browse/SPARK-10610
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Yi Tian
>Priority: Minor
>
> When we using {{JMX}} to monitor spark system,  We have to configure the name 
> of target metrics in the monitor system. But the current name of metrics is 
> {{appId}} + {{executorId}} + {{source}} . So when the spark program 
> restarted, we have to update the name of metrics in the monitor system.
> We should add an optional configuration property to control whether using the 
> appName instead of appId in spark metrics system.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:02 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:15 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

It is very difficult to use because an executor configures with 10GB ram could 
have 20 tasks or 1 task allocated to it (assuming 1 cpu per task). Say each 
task could use up to 2GB of RAM, it would be a OOM for 20 tasks (40GB required) 
and underutilized for 1 task (2GB required). 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228
 ] 

Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:16 PM:


I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

It is very difficult to use (or it is impossible to use?) because an executor 
configures with 10GB ram could have 20 tasks or 1 task allocated to it 
(assuming 1 cpu per task). Say each task could use up to 2GB of RAM, it would 
be a OOM for 20 tasks (40GB required) and underutilized for 1 task (2GB 
required). 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!


was (Author: superwai):
I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal when I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

It is very difficult to use because an executor configures with 10GB ram could 
have 20 tasks or 1 task allocated to it (assuming 1 cpu per task). Say each 
task could use up to 2GB of RAM, it would be a OOM for 20 tasks (40GB required) 
and underutilized for 1 task (2GB required). 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Also, I notice that there is much better features on Spark with Yarn. Does it 
mean it is better to run spark on Yarn than Mesos? 

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6723) Model import/export for ChiSqSelector

2015-10-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-6723:
-
Assignee: Jayant Shekhar

> Model import/export for ChiSqSelector
> -
>
> Key: SPARK-6723
> URL: https://issues.apache.org/jira/browse/SPARK-6723
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Jayant Shekhar
>Priority: Minor
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Jerry Lam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228
 ] 

Jerry Lam commented on SPARK-4940:
--

I just want to weight in the importance of this issue. My observation is that 
using coarse grained mode, it is possible that if I configure total core max to 
20, I could end up having ONE executor with 20 cores. This is not ideal even I 
have 5 slaves with 32 cores each. It would makes more sense to have ONE 
executor per slave and each executor has 4 cores. 

Is there a workaround at this moment using Spark 1.5.1. to make load more 
evenly distributed on mesos. How people actually use spark on mesos when the 
resource is not distributed evenly?

Thanks!

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10975) Shuffle files left behind on Mesos without dynamic allocation

2015-10-23 Thread Chris Bannister (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971265#comment-14971265
 ] 

Chris Bannister commented on SPARK-10975:
-

Spark will use MESOS_DIRECTORY sandbox when not using shuffle service now that 
SPARK-9708 is merged. Is this a duplicate?

> Shuffle files left behind on Mesos without dynamic allocation
> -
>
> Key: SPARK-10975
> URL: https://issues.apache.org/jira/browse/SPARK-10975
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
>Reporter: Iulian Dragos
>Priority: Blocker
>
> (from mailing list)
> Running on Mesos in coarse-grained mode. No dynamic allocation or shuffle 
> service. 
> I see that there are two types of temporary files under /tmp folder 
> associated with every executor: /tmp/spark- and /tmp/blockmgr-. 
> When job is finished /tmp/spark- is gone, but blockmgr directory is 
> left with all gigabytes in it. 
> The reason is that logic to clean up files is only enabled when the shuffle 
> service is running, see https://github.com/apache/spark/pull/7820
> The shuffle files should be placed in the Mesos sandbox or under `tmp/spark` 
> unless the shuffle service is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11280) Mesos cluster deployment using only one node

2015-10-23 Thread Iulian Dragos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iulian Dragos updated SPARK-11280:
--
Attachment: Screen Shot 2015-10-23 at 11.37.43.png

> Mesos cluster deployment using only one node
> 
>
> Key: SPARK-11280
> URL: https://issues.apache.org/jira/browse/SPARK-11280
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Iulian Dragos
> Attachments: Screen Shot 2015-10-23 at 11.37.43.png
>
>
> I submit the SparkPi example in Mesos cluster mode, and I notice that all 
> tasks fail except the ones that run on the same node as the driver. The 
> others fail with
> {code}
> sh: 1: 
> /tmp/mesos/slaves/1521e408-d8fe-416d-898b-3801e73a8293-S0/frameworks/1521e408-d8fe-416d-898b-3801e73a8293-0003/executors/driver-20151023113121-0006/runs/2abefd29-7386-4d81-a025-9d794780db23/spark-1.5.0-bin-hadoop2.6/bin/spark-class:
>  not found
> {code}
> The path exists only on the machine that launched the driver, and the sandbox 
> of the executor where this task died is completely empty.
> I launch the task like this:
> {code}
>  $ spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 
> --conf 
> spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.5.0-bin-hadoop2.6.tgz; 
> --conf spark.mesos.coarse=true --class org.apache.spark.examples.SparkPi 
> ftp://sagitarius.local/ftp/spark-examples-1.5.0-hadoop2.6.0.jar
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request to launch 
> an application in mesos://sagitarius.local:7077.
> 15/10/23 11:31:21 INFO RestSubmissionClient: Submission successfully created 
> as driver-20151023113121-0006. Polling submission state...
> 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request for the 
> status of submission driver-20151023113121-0006 in 
> mesos://sagitarius.local:7077.
> 15/10/23 11:31:21 INFO RestSubmissionClient: State of driver 
> driver-20151023113121-0006 is now QUEUED.
> 15/10/23 11:31:21 INFO RestSubmissionClient: Server responded with 
> CreateSubmissionResponse:
> {
>   "action" : "CreateSubmissionResponse",
>   "serverSparkVersion" : "1.5.0",
>   "submissionId" : "driver-20151023113121-0006",
>   "success" : true
> }
> {code}
> I can see the driver in the Dispatcher UI and the job succeeds eventually, 
> but running only on the node where the driver was launched (see attachment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7021) JUnit output for Python tests

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7021:
-
Assignee: Gabor Liptak

> JUnit output for Python tests
> -
>
> Key: SPARK-7021
> URL: https://issues.apache.org/jira/browse/SPARK-7021
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Project Infra
>Reporter: Brennon York
>Assignee: Gabor Liptak
>Priority: Minor
>  Labels: starter
> Fix For: 1.6.0
>
>
> Currently python returns its test output in its own format. What would be 
> preferred is if the Python test runner could output its test results in JUnit 
> format to better match the rest of the Jenkins test output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-23 Thread Russell Pierce (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970917#comment-14970917
 ] 

Russell Pierce edited comment on SPARK-9325 at 10/23/15 12:53 PM:
--

You're right, Spark had been producing an error because the df$col in question 
was a TINYINT stored in Parquet, not that the command itself didn't work; that 
problem seems to have been addressed in another Issue 
(https://issues.apache.org/jira/browse/SPARK-3575).


was (Author: rpierce):
You're right, Spark had been producing an error because the df$col in question 
was a TINYINT stored in Parquet, not that the command itself didn't work.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

{code}
spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]
{code}

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:


spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11282.
---
Resolution: Duplicate

[~maver1ck] this could use a better title, and there is no code attached. I 
also strongly suspect it duplicates 
https://issues.apache.org/jira/browse/SPARK-10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
>   spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=5, val2=5)]
>   spark$ ~/spark/bin/spark-submit --executor-memory 32G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster

2015-10-23 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970930#comment-14970930
 ] 

Steve Loughran commented on SPARK-11265:


I can trigger a failure in a unit test now, once you get pass Hive failing to 
load (classpath issue), the {{get()}} operation fails
{code}
 obtain Tokens For HiveMetastore *** FAILED ***  
java.lang.IllegalArgumentException: wrong number of arguments
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokenForHiveMetastoreInner(YarnSparkHadoopUtil.scala:203)
  at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtilSuite$$anonfun$22.apply(YarnSparkHadoopUtilSuite.scala:254)
  at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtilSuite$$anonfun$22.apply(YarnSparkHadoopUtilSuite.scala:249)
  at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
{code}

> YarnClient can't get tokens to talk to Hive in a secure cluster
> ---
>
> Key: SPARK-11265
> URL: https://issues.apache.org/jira/browse/SPARK-11265
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.1
> Environment: Kerberized Hadoop cluster
>Reporter: Steve Loughran
>
> As reported on the dev list, trying to run a YARN client which wants to talk 
> to Hive in a Kerberized hadoop cluster fails. This appears to be because the 
> constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was 
> made private and replaced with a factory method. The YARN client uses 
> reflection to get the tokens, so the signature changes weren't picked up in 
> SPARK-8064.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970931#comment-14970931
 ] 

Maciej Bryński commented on SPARK-11282:


We had race condition here.
I was attaching file when you answered.

I'll try solution of 10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970952#comment-14970952
 ] 

Maciej Bryński commented on SPARK-11282:


UPDATE:
Looks like:  -XX:-UseCompressedOops solve the problem.

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set

2015-10-23 Thread Glyton Camilleri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971004#comment-14971004
 ] 

Glyton Camilleri commented on SPARK-6847:
-

Hi,
we managed to actually get rid of the overflow issues by settings checkpoints 
on more streams than we thought we needed to, in addition to implementing a 
small change following your suggestion; before the fix, the setup was similar 
to what you describe:

{code}
val dStream1 = // create kafka stream and do some preprocessing
val dStream2 = dStream1.updateStateByKey { func }.checkpoint(timeWindow * 2)
val dStream3 = dStream2.map { ... }

// (1) perform some side-effect on the state
if (certainConditionsAreMet) dStream2.foreachRDD { 
  _.foreachPartition { ... }
}

// (2) publish final results to a set of Kafka topics
dStream3.transform { ... }.foreachRDD {
  _.foreachPartition { ... }
}
{code}

There were two things we did:
a) set different checkpoints for {{dStream2}} and {{dStream3}}, whereas before 
we were only setting the checkpoint for {{dStream2}}
b) changed (1) above such then when {{!certainConditionsAreMet}}, we just 
consume the stream like you describe in your suggestion

I honestly think that b) was more likely to be influential in removing the 
StackOverflowError really, but we decided to leave the checkpoint settings in 
a) there anyway.
Apologies for the late follow-up, but we needed to make sure the issue had 
actually been resolved.

> Stack overflow on updateStateByKey which followed by a dstream with 
> checkpoint set
> --
>
> Key: SPARK-6847
> URL: https://issues.apache.org/jira/browse/SPARK-6847
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.3.0
>Reporter: Jack Hu
>  Labels: StackOverflowError, Streaming
>
> The issue happens with the following sample code: uses {{updateStateByKey}} 
> followed by a {{map}} with checkpoint interval 10 seconds
> {code}
> val sparkConf = new SparkConf().setAppName("test")
> val streamingContext = new StreamingContext(sparkConf, Seconds(10))
> streamingContext.checkpoint("""checkpoint""")
> val source = streamingContext.socketTextStream("localhost", )
> val updatedResult = source.map(
> (1,_)).updateStateByKey(
> (newlist : Seq[String], oldstate : Option[String]) => 
> newlist.headOption.orElse(oldstate))
> updatedResult.map(_._2)
> .checkpoint(Seconds(10))
> .foreachRDD((rdd, t) => {
>   println("Deep: " + rdd.toDebugString.split("\n").length)
>   println(t.toString() + ": " + rdd.collect.length)
> })
> streamingContext.start()
> streamingContext.awaitTermination()
> {code}
> From the output, we can see that the dependency will be increasing time over 
> time, the {{updateStateByKey}} never get check-pointed,  and finally, the 
> stack overflow will happen. 
> Note:
> * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but 
> not the {{updateStateByKey}} 
> * If remove the {{checkpoint(Seconds(10))}} from the map result ( 
> {{updatedResult.map(_._2)}} ), the stack overflow will not happen



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11283) List column gets additional level of nesting when converted to Spark DataFrame

2015-10-23 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-11283:
--

 Summary: List column gets additional level of nesting when 
converted to Spark DataFrame
 Key: SPARK-11283
 URL: https://issues.apache.org/jira/browse/SPARK-11283
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.0
 Environment: R 3.2.2, Spark build from master 
487d409e71767c76399217a07af8de1bb0da7aa8
Reporter: Maciej Szymkiewicz


When input data frame contains list column there is an additional level of 
nesting in a Spark DataFrame and as a result collected data is no longer 
identical to input:

{code}
ldf <- data.frame(row.names=1:2)
ldf$x <- list(list(1), list(2))
sdf <- createDataFrame(sqlContext, ldf)

printSchema(sdf)
## root
##  |-- x: array (nullable = true)
##  ||-- element: array (containsNull = true)
##  |||-- element: double (containsNull = true)

identical(ldf, collect(sdf))
## [1] FALSE
{code}

Comparing structure:

Local df

{code}
unclass(ldf)
## $x
## $x[[1]]
## $x[[1]][[1]]
## [1] 1
##
## $x[[2]]
## $x[[2]][[1]]
## [1] 2
##
## attr(,"row.names")
## [1] 1 2
{code}

Collected

{code}
unclass(collect(sdf))
## $x
## $x[[1]]
## $x[[1]][[1]]
## $x[[1]][[1]][[1]]
## [1] 1
## 
## $x[[2]]
## $x[[2]][[1]]
## $x[[2]][[1]][[1]]
## [1] 2
##
## attr(,"row.names")
## [1] 1 2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-10-23 Thread Russell Pierce (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970917#comment-14970917
 ] 

Russell Pierce commented on SPARK-9325:
---

You're right, Spark had been producing an error because the df$col in question 
was a TINYINT stored in Parquet, not that the command itself didn't work.

> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-23 Thread Maciej Szymkiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970922#comment-14970922
 ] 

Maciej Szymkiewicz commented on SPARK-11167:


Related problem: https://issues.apache.org/jira/browse/SPARK-11281


> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-10-23 Thread Jim Haughwout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970915#comment-14970915
 ] 

Jim Haughwout commented on SPARK-6270:
--

[~tdas]: Can the team update this issue to reflect that this _also_ affects 
Versions 1.3.1, 1.4.0, 1.4.1, 1.5.0, and 1.5.1?

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:


spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
>   
>   spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=5, val2=5)]
>   spark$ ~/spark/bin/spark-submit --executor-memory 32G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971027#comment-14971027
 ] 

Apache Spark commented on SPARK-11016:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/9243

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11016:


Assignee: (was: Apache Spark)

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Description: 
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.

  was:
Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.


> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
>   spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
>   Selecting data for id = 5...
>   [Row(id=5, val=5, id2=5, val2=5)]
>   spark$ ~/spark/bin/spark-submit --executor-memory 32G 
> debug_broadcast_join.py true
>   Creating test tables...
>   Joining tables...
>   Joined table schema:
>   root
>|-- id: long (nullable = true)
>|-- val: long (nullable = true)
>|-- id2: long (nullable = true)
>|-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Bryński updated SPARK-11282:
---
Attachment: SPARK-11282.py

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970931#comment-14970931
 ] 

Maciej Bryński edited comment on SPARK-11282 at 10/23/15 1:07 PM:
--

We had race condition here.
I was attaching file when you answered.

You're probably right.
I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914


was (Author: maver1ck):
We had race condition here.
I was attaching file when you answered.

Uue're probably right.
I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970931#comment-14970931
 ] 

Maciej Bryński edited comment on SPARK-11282 at 10/23/15 1:07 PM:
--

We had race condition here.
I was attaching file when you answered.

Uue're probably right.
I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914


was (Author: maver1ck):
We had race condition here.
I was attaching file when you answered.

I'll try solution of 10914

> Very strange broadcast join behaviour
> -
>
> Key: SPARK-11282
> URL: https://issues.apache.org/jira/browse/SPARK-11282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.1
>Reporter: Maciej Bryński
>Priority: Critical
> Attachments: SPARK-11282.py
>
>
> Hi,
> I found very strange broadcast join behaviour.
> According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
> I'm using hint for broadcast join. (I patched 1.5.1 with 
> https://github.com/apache/spark/pull/8801/files )
> I found that working of this feature depends on Executor Memory.
> In my case broadcast join is working up to 31G. 
> Example:
> {code}
> spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
> debug_broadcast_join.py true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=5, val2=5)]
> spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
> true
> Creating test tables...
> Joining tables...
> Joined table schema:
> root
>  |-- id: long (nullable = true)
>  |-- val: long (nullable = true)
>  |-- id2: long (nullable = true)
>  |-- val2: long (nullable = true)
> Selecting data for id = 5...
> [Row(id=5, val=5, id2=None, val2=None)]
> {code}
> Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame

2015-10-23 Thread Frank Rosner (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971025#comment-14971025
 ] 

Frank Rosner commented on SPARK-11258:
--

Actually I am pretty confused now. Thinking about it, having a for loop and a 
map should not be accessing every element more then one time. However, it still 
seems to be more complex than required to me. Let me try to reproduce the fact 
that we could not load it with the old function but with the new one. Maybe to 
.toArray method is a problem with memory as it is first recreating the whole 
shabang and then copying it to another array?

> Remove quadratic runtime complexity for converting a Spark DataFrame into an 
> R data.frame
> -
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Introduction
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run 
> time complexity (it goes through the complete data set _m_ times, where _m_ 
> is the number of columns.
> h4. Problem
> The {{dfToCols}} method is transposing the row-wise representation of the 
> Spark DataFrame (array of rows) into a column wise representation (array of 
> columns) to then be put into a data frame. This is done in a very inefficient 
> way, yielding to huge performance (and possibly also memory) problems when 
> collecting bigger data frames.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect gets even bigger, the more columns you have.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11282) Very strange broadcast join behaviour

2015-10-23 Thread JIRA
Maciej Bryński created SPARK-11282:
--

 Summary: Very strange broadcast join behaviour
 Key: SPARK-11282
 URL: https://issues.apache.org/jira/browse/SPARK-11282
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.5.1
Reporter: Maciej Bryński
Priority: Critical


Hi,
I found very strange broadcast join behaviour.

According to this Jira https://issues.apache.org/jira/browse/SPARK-10577
I'm using hint for broadcast join. (I patched 1.5.1 with 
https://github.com/apache/spark/pull/8801/files )

I found that working of this feature depends on Executor Memory.
In my case broadcast join is working up to 31G. 

Example:

spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G 
debug_broadcast_join.py true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=5, val2=5)]
spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py 
true
Creating test tables...
Joining tables...
Joined table schema:
root
 |-- id: long (nullable = true)
 |-- val: long (nullable = true)
 |-- id2: long (nullable = true)
 |-- val2: long (nullable = true)

Selecting data for id = 5...
[Row(id=5, val=5, id2=None, val2=None)]

Please find example code attached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11281) Issue with creating and collecting DataFrame using environments

2015-10-23 Thread Maciej Szymkiewicz (JIRA)
Maciej Szymkiewicz created SPARK-11281:
--

 Summary: Issue with creating and collecting DataFrame using 
environments 
 Key: SPARK-11281
 URL: https://issues.apache.org/jira/browse/SPARK-11281
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.6.0
 Environment: R 3.2.2, Spark build from master  
487d409e71767c76399217a07af8de1bb0da7aa8
Reporter: Maciej Szymkiewicz


It is not possible to to access Map field created from an environment. Assuming 
local data frame is created as follows:

{code}
ldf <- data.frame(row.names=1:2)
ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3)))
str(ldf)
## 'data.frame':2 obs. of  1 variable:
##  $ x:List of 2
##   ..$ : 
##   ..$ : 

get("a", ldf$x[[1]])
## [1] 1

get("c", ldf$x[[2]])
## [1] 3
{code}

It is possible to create a Spark data frame:

{code}
sdf <- createDataFrame(sqlContext, ldf)
printSchema(sdf)

## root
##  |-- x: array (nullable = true)
##  ||-- element: map (containsNull = true)
##  |||-- key: string
##  |||-- value: double (valueContainsNull = true)
{code}

but it throws:

{code}
java.lang.IllegalArgumentException: Invalid array type e
{code}

on collect / head. 

Problem seems to be specific to environments and cannot be reproduced when Map 
comes for example from Cassandra table.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-23 Thread Maciej Szymkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-11167:
---
Comment: was deleted

(was: Related problem: https://issues.apache.org/jira/browse/SPARK-11281
)

> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-10-23 Thread Steve Loughran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Loughran updated SPARK-6270:
--
Affects Version/s: 1.5.1

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11284) ALS produces predictions as floats and should be double

2015-10-23 Thread Dominik Dahlem (JIRA)
Dominik Dahlem created SPARK-11284:
--

 Summary: ALS produces predictions as floats and should be double
 Key: SPARK-11284
 URL: https://issues.apache.org/jira/browse/SPARK-11284
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 1.5.1
 Environment: All
Reporter: Dominik Dahlem


Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated using 
the RegressionEvaluator, because of a type mis-match between the model 
transformation and the evaluation APIs. One can work around this by casting the 
prediction column into double before passing it into the evaluator. However, 
this does not work with pipelines and cross validation.

Code and traceback below:

{code}
als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', 
itemCol='movieID', ratingCol='rating')
model = als.fit(training)
predictions = model.transform(validation)
evaluator = RegressionEvaluator(predictionCol='prediction', 
labelCol='rating')
validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
'rmse'})
{code}

Traceback:
validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 
'rmse'})
  File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
 line 63, in evaluate
  File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py",
 line 94, in _evaluate
  File 
"/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
 line 813, in __call__
  File 
"/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py", 
line 42, in deco
raise IllegalArgumentException(s.split(': ', 1)[1])
pyspark.sql.utils.IllegalArgumentException: requirement failed: Column 
prediction must be of type DoubleType but was actually FloatType.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971042#comment-14971042
 ] 

Apache Spark commented on SPARK-10562:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9251

> .partitionBy() creates the metastore partition columns in all lowercase, but 
> persists the data path as MixedCase resulting in an error when the data is 
> later attempted to query.
> -
>
> Key: SPARK-10562
> URL: https://issues.apache.org/jira/browse/SPARK-10562
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Jason Pohl
>Assignee: Wenchen Fan
> Attachments: MixedCasePartitionBy.dbc
>
>
> When using DataFrame.write.partitionBy().saveAsTable() it creates the 
> partiton by columns in all lowercase in the meta-store.  However, it writes 
> the data to the filesystem using mixed-case.
> This causes an error when running a select against the table.
> --
> from pyspark.sql import Row
> # Create a data frame with mixed case column names
> myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015),
>Row(Name="Frank Lampard", Goals=15, Year=2012)])
> myDF = sqlContext.createDataFrame(myRDD)
> # Write this data out to a parquet file and partition by the Year (which is a 
> mixedCase name)
> myDF.write.partitionBy("Year").saveAsTable("chelsea_goals")
> %sql show create table chelsea_goals;
> --The metastore is showwing a partition column name of all lowercase "year"
> # Verify that the data is written with appropriate partitions
> display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals"))
> %sql
> --Now try to run a query against this table
> select * from chelsea_goals
> Error in SQL statement: UncheckedExecutionException: 
> java.lang.RuntimeException: Partition column year not found in schema 
> StructType(StructField(Goals,LongType,true), 
> StructField(Name,StringType,true), StructField(Year,LongType,true))
> # Now lets try this again using a lowercase column name
> myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015),
>  Row(Name="Frank Lampard", Goals=15, year=2012)])
> myDF2 = sqlContext.createDataFrame(myRDD2)
> myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2")
> %sql select * from chelsea_goals2;
> --Now everything works



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970777#comment-14970777
 ] 

Apache Spark commented on SPARK-10947:
--

User 'stephend-realitymine' has created a pull request for this issue:
https://github.com/apache/spark/pull/9249

> With schema inference from JSON into a Dataframe, add option to infer all 
> primitive object types as strings
> ---
>
> Key: SPARK-10947
> URL: https://issues.apache.org/jira/browse/SPARK-10947
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Ewan Leith
>Priority: Minor
>
> Currently, when a schema is inferred from a JSON file using 
> sqlContext.read.json, the primitive object types are inferred as string, 
> long, boolean, etc.
> However, if the inferred type is too specific (JSON obviously does not 
> enforce types itself), this causes issues with merging dataframe schemas.
> Instead, we would like an option in the JSON inferField function to treat all 
> primitive objects as strings.
> We'll create and submit a pull request for this for review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)

2015-10-23 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970870#comment-14970870
 ] 

Nick Pentreath commented on SPARK-7008:
---

Is this now going in 1.6 (as per SPARK-10324)? If so is there a PR, since I 
cannot find one related.

> An implementation of Factorization Machine (LibFM)
> --
>
> Key: SPARK-7008
> URL: https://issues.apache.org/jira/browse/SPARK-7008
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: zhengruifeng
>  Labels: features
> Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, 
> QQ20150421-2.png
>
>
> An implementation of Factorization Machines based on Scala and Spark MLlib.
> FM is a kind of machine learning algorithm for multi-linear regression, and 
> is widely used for recommendation.
> FM works well in recent years' recommendation competitions.
> Ref:
> http://libfm.org/
> http://doi.acm.org/10.1145/2168752.2168771
> http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11278) PageRank fails with unified memory manager

2015-10-23 Thread Nishkam Ravi (JIRA)
Nishkam Ravi created SPARK-11278:


 Summary: PageRank fails with unified memory manager
 Key: SPARK-11278
 URL: https://issues.apache.org/jira/browse/SPARK-11278
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.2, 1.6.0
Reporter: Nishkam Ravi


PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
ExecutorLostFailure. Traced it back to the 'unified memory manager' commit from 
Oct 13th. Took a quick look at the code and couldn't see the problem (changes 
look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to spot the 
problem quickly. Can be reproduced by running PageRank on a large enough input 
dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970694#comment-14970694
 ] 

Apache Spark commented on SPARK-11277:
--

User 'jliwork' has created a pull request for this issue:
https://github.com/apache/spark/pull/9247

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11277) sort_array throws exception scala.MatchError

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11277:


Assignee: (was: Apache Spark)

> sort_array throws exception scala.MatchError
> 
>
> Key: SPARK-11277
> URL: https://issues.apache.org/jira/browse/SPARK-11277
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Linux
>Reporter: Jia Li
>Priority: Minor
>
> I was trying out the sort_array function then hit this exception. 
> I looked into the spark source code. I found the root cause is that 
> sort_array does not check for an array of NULLs. It's not meaningful to sort 
> an array of entirely NULLs anyway.
> I already have a fix for this issue and I'm going to create a pull request 
> for it. 
> scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show()
> scala.MatchError: ArrayType(NullType,true) (of class 
> org.apache.spark.sql.types.ArrayType)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67)
>   at 
> org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111)
>   at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-23 Thread patcharee (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

patcharee closed SPARK-11087.
-
Resolution: Not A Problem

The predicate is indeed generated and can be found in the executor log

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 
> Data was inserted into this table by another spark job>
> 

[jira] [Commented] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic

2015-10-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970801#comment-14970801
 ] 

Yanbo Liang commented on SPARK-9265:


I'm working on it.

> Dataframe.limit joined with another dataframe can be non-deterministic
> --
>
> Key: SPARK-9265
> URL: https://issues.apache.org/jira/browse/SPARK-9265
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1
>Reporter: Tathagata Das
>Priority: Critical
>
> {code}
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val recentFailures = table("failed_suites").cache()
> val topRecentFailures = 
> recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10)
> topRecentFailures.show(100)
> val mot = topRecentFailures.as("a").join(recentFailures.as("b"), 
> $"a.suiteName" === $"b.suiteName")
>   
> (1 to 10).foreach { i => 
>   println(s"$i: " + mot.count())
> }
> {code}
> This shows.
> {code}
> ++-+
> |   suiteName|failCount|
> ++-+
> |org.apache.spark|   85|
> |org.apache.spark|   26|
> |org.apache.spark|   26|
> |org.apache.spark|   17|
> |org.apache.spark|   17|
> |org.apache.spark|   15|
> |org.apache.spark|   13|
> |org.apache.spark|   13|
> |org.apache.spark|   11|
> |org.apache.spark|9|
> ++-+
> 1: 174
> 2: 166
> 3: 174
> 4: 106
> 5: 158
> 6: 110
> 7: 174
> 8: 158
> 9: 166
> 10: 106
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11262:
--
Target Version/s:   (was: 1.6.0)
   Fix Version/s: (was: 1.5.1)

[~avulanov] don't set Fix/Target version please

> Unit test for gradient, loss layers, memory management for multilayer 
> perceptron
> 
>
> Key: SPARK-11262
> URL: https://issues.apache.org/jira/browse/SPARK-11262
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.5.1
>Reporter: Alexander Ulanov
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Multi-layer perceptron requires more rigorous tests and refactoring of layer 
> interfaces to accommodate development of new features.
> 1)Implement unit test for gradient and loss
> 2)Refactor the internal layer interface to extract "loss function" 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11267) NettyRpcEnv and sparkDriver services report the same port in the logs

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11267:
--
Component/s: Spark Core

> NettyRpcEnv and sparkDriver services report the same port in the logs
> -
>
> Key: SPARK-11267
> URL: https://issues.apache.org/jira/browse/SPARK-11267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
> Environment: the version built from today's sources - Spark version 
> 1.6.0-SNAPSHOT
>Reporter: Jacek Laskowski
>Priority: Minor
>
> When starting {{./bin/spark-shell --conf spark.driver.port=}} Spark 
> reports two services - NettyRpcEnv and sparkDriver - using the same {{}} 
> port:
> {code}
> 15/10/22 23:09:32 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT
> 15/10/22 23:09:32 INFO SparkContext: Spark configuration:
> spark.app.name=Spark shell
> spark.driver.port=
> spark.home=/Users/jacek/dev/oss/spark
> spark.jars=
> spark.logConf=true
> spark.master=local[*]
> spark.repl.class.uri=http://192.168.1.4:52645
> spark.submit.deployMode=client
> ...
> 15/10/22 23:09:33 INFO Utils: Successfully started service 'NettyRpcEnv' on 
> port .
> ...
> 15/10/22 23:09:33 INFO Utils: Successfully started service 'sparkDriver' on 
> port .
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11270) Add improved equality testing for TopicAndPartition from the Kafka Streaming API

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11270:
--
Target Version/s:   (was: 1.5.1)
   Fix Version/s: (was: 1.5.1)

[~manygrams] have a look at 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark  Among 
other things, don't set Target/Fix version.

> Add improved equality testing for TopicAndPartition from the Kafka Streaming 
> API
> 
>
> Key: SPARK-11270
> URL: https://issues.apache.org/jira/browse/SPARK-11270
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Streaming
>Affects Versions: 1.5.1
>Reporter: Nick Evans
>Priority: Minor
>
> Hey, sorry, new to contributing to Spark! Let me know if I'm doing anything 
> wrong.
> This issue is in relation to equality testing of a TopicAndPartition object. 
> It allows you to test that the topics and partitions of two of these objects 
> are equal, as opposed to checking that the two objects are the same instance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate

2015-10-23 Thread patcharee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970786#comment-14970786
 ] 

patcharee commented on SPARK-11087:
---

[~zzhan] I found the predicate generated in the executor log for the case using 
dataframe (not hiveContext.sql). Sorry for my mistake, and thanks for your help!

> spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
> -
>
> Key: SPARK-11087
> URL: https://issues.apache.org/jira/browse/SPARK-11087
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: orc file version 0.12 with HIVE_8732
> hive version 1.2.1.2.3.0.0-2557
>Reporter: patcharee
>Priority: Minor
>
> I have an external hive table stored as partitioned orc file (see the table 
> schema below). I tried to query from the table with where clause>
> hiveContext.setConf("spark.sql.orc.filterPushdown", "true")
> hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = 
> 117")). 
> But from the log file with debug logging level on, the ORC pushdown predicate 
> was not generated. 
> Unfortunately my table was not sorted when I inserted the data, but I 
> expected the ORC pushdown predicate should be generated (because of the where 
> clause) though
> Table schema
> 
> hive> describe formatted 4D;
> OK
> # col_namedata_type   comment 
>
> date  int 
> hhint 
> x int 
> y int 
> heightfloat   
> u float   
> v float   
> w float   
> phfloat   
> phb   float   
> t float   
> p float   
> pbfloat   
> qvaporfloat   
> qgraupfloat   
> qnice float   
> qnrainfloat   
> tke_pbl   float   
> el_pblfloat   
> qcloudfloat   
>
> # Partition Information
> # col_namedata_type   comment 
>
> zone  int 
> z int 
> year  int 
> month int 
>
> # Detailed Table Information   
> Database: default  
> Owner:patcharee
> CreateTime:   Thu Jul 09 16:46:54 CEST 2015
> LastAccessTime:   UNKNOWN  
> Protect Mode: None 
> Retention:0
> Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D   
>  
> Table Type:   EXTERNAL_TABLE   
> Table Parameters:  
>   EXTERNALTRUE
>   comment this table is imported from rwf_data/*/wrf/*
>   last_modified_bypatcharee   
>   last_modified_time  1439806692  
>   orc.compressZLIB
>   transient_lastDdlTime   1439806692  
>
> # Storage Information  
> SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:  org.apache.hadoop.hive.ql.io.orc.OrcInputFormat  
> OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>  
> Compressed:   No   
> Num Buckets:  -1   
> Bucket Columns:   []   
> Sort Columns: []   
> Storage Desc Params:   
>   serialization.format1   
> Time taken: 0.388 seconds, Fetched: 58 row(s)
> 

[jira] [Commented] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970705#comment-14970705
 ] 

Apache Spark commented on SPARK-11279:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9248

> Add DataFrame#toDF in PySpark
> -
>
> Key: SPARK-11279
> URL: https://issues.apache.org/jira/browse/SPARK-11279
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11259) Params.validateParams() should be called automatically

2015-10-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11259:

Description: 
Params.validateParams() can not be called automatically currently. Such as the 
following code snippet will not throw exception which is not as expected.
{code}
val df = sqlContext.createDataFrame(
  Seq(
(1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
(2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
(3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
(4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
).toDF("id", "features", "label")

val scaler = new MinMaxScaler()
 .setInputCol("features")
 .setOutputCol("features_scaled")
 .setMin(10)
 .setMax(0)
val pipeline = new Pipeline().setStages(Array(scaler))
pipeline.fit(df)
{code}
validateParams() should be called by 
PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
put it in transformSchema(). 

  was:
Params.validateParams() not be called automatically currently. Such as the 
following code snippet will not throw exception which is not as expected.
{code}
val df = sqlContext.createDataFrame(
  Seq(
(1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
(2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
(3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
(4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
).toDF("id", "features", "label")

val scaler = new MinMaxScaler()
 .setInputCol("features")
 .setOutputCol("features_scaled")
 .setMin(10)
 .setMax(0)
val pipeline = new Pipeline().setStages(Array(scaler))
pipeline.fit(df)
{code}
validateParams() should be called by 
PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
put it in transformSchema().


> Params.validateParams() should be called automatically
> --
>
> Key: SPARK-11259
> URL: https://issues.apache.org/jira/browse/SPARK-11259
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Params.validateParams() can not be called automatically currently. Such as 
> the following code snippet will not throw exception which is not as expected.
> {code}
> val df = sqlContext.createDataFrame(
>   Seq(
> (1, Vectors.dense(0.0, 1.0, 4.0), 1.0),
> (2, Vectors.dense(1.0, 0.0, 4.0), 2.0),
> (3, Vectors.dense(1.0, 0.0, 5.0), 3.0),
> (4, Vectors.dense(0.0, 0.0, 5.0), 4.0))
> ).toDF("id", "features", "label")
> val scaler = new MinMaxScaler()
>  .setInputCol("features")
>  .setOutputCol("features_scaled")
>  .setMin(10)
>  .setMax(0)
> val pipeline = new Pipeline().setStages(Array(scaler))
> pipeline.fit(df)
> {code}
> validateParams() should be called by 
> PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to 
> put it in transformSchema(). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth

2015-10-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970887#comment-14970887
 ] 

Yanbo Liang commented on SPARK-6724:


[~MeethuMathew] I will take over this task and send a PR, welcome to comment on 
my PR.

> Model import/export for FPGrowth
> 
>
> Key: SPARK-6724
> URL: https://issues.apache.org/jira/browse/SPARK-6724
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Note: experimental model API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11279:


Assignee: (was: Apache Spark)

> Add DataFrame#toDF in PySpark
> -
>
> Key: SPARK-11279
> URL: https://issues.apache.org/jira/browse/SPARK-11279
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11279) Add DataFrame#toDF in PySpark

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11279:


Assignee: Apache Spark

> Add DataFrame#toDF in PySpark
> -
>
> Key: SPARK-11279
> URL: https://issues.apache.org/jira/browse/SPARK-11279
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11280) Mesos cluster deployment using only one node

2015-10-23 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-11280:
-

 Summary: Mesos cluster deployment using only one node
 Key: SPARK-11280
 URL: https://issues.apache.org/jira/browse/SPARK-11280
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.5.1, 1.6.0
Reporter: Iulian Dragos


I submit the SparkPi example in Mesos cluster mode, and I notice that all tasks 
fail except the ones that run on the same node as the driver. The others fail 
with

{code}
sh: 1: 
/tmp/mesos/slaves/1521e408-d8fe-416d-898b-3801e73a8293-S0/frameworks/1521e408-d8fe-416d-898b-3801e73a8293-0003/executors/driver-20151023113121-0006/runs/2abefd29-7386-4d81-a025-9d794780db23/spark-1.5.0-bin-hadoop2.6/bin/spark-class:
 not found
{code}

The path exists only on the machine that launched the driver, and the sandbox 
of the executor where this task died is completely empty.

I launch the task like this:

{code}
 $ spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 
--conf 
spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.5.0-bin-hadoop2.6.tgz; 
--conf spark.mesos.coarse=true --class org.apache.spark.examples.SparkPi 
ftp://sagitarius.local/ftp/spark-examples-1.5.0-hadoop2.6.0.jar
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request to launch an 
application in mesos://sagitarius.local:7077.
15/10/23 11:31:21 INFO RestSubmissionClient: Submission successfully created as 
driver-20151023113121-0006. Polling submission state...
15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request for the 
status of submission driver-20151023113121-0006 in 
mesos://sagitarius.local:7077.
15/10/23 11:31:21 INFO RestSubmissionClient: State of driver 
driver-20151023113121-0006 is now QUEUED.
15/10/23 11:31:21 INFO RestSubmissionClient: Server responded with 
CreateSubmissionResponse:
{
  "action" : "CreateSubmissionResponse",
  "serverSparkVersion" : "1.5.0",
  "submissionId" : "driver-20151023113121-0006",
  "success" : true
}
{code}

I can see the driver in the Dispatcher UI and the job succeeds eventually, but 
running only on the node where the driver was launched (see attachment).





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970783#comment-14970783
 ] 

Sean Owen commented on SPARK-11016:
---

NB: the resolution here may be to simply remove usage of roaringbitmaps: 
https://github.com/apache/spark/pull/9243

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0

2015-10-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-11229:
---

"Fixed" implies there was a change attached to this JIRA that resolved the 
issue, and we don't have that here. If it were probably resolved by another 
JIRA, "duplicate" would be appropriate. Otherwise, *shrug* doesn't really 
matter but "cannot reproduce" is maybe most accurate. 

> NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
> -
>
> Key: SPARK-11229
> URL: https://issues.apache.org/jira/browse/SPARK-11229
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux
>Reporter: Romi Kuntsman
>
> Steps to reproduce:
> 1. set spark.shuffle.memoryFraction=0
> 2. load dataframe from parquet file
> 3. see it's read correctly by calling dataframe.show()
> 4. call dataframe.count()
> Expected behaviour:
> get count of rows in dataframe
> OR, if memoryFraction=0 is an invalid setting, get notified about it
> Actual behaviour:
> CatalystReadSupport doesn't read the schema (even thought there is one) and 
> then there's a NullPointerException.
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>   at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>   at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919)
>   at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
>   at 
> org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903)
>   at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384)
>   at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402)
>   ... 14 more
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192)
>   at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368)
>   at 
> 

[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode

2015-10-23 Thread Martin Tapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971416#comment-14971416
 ] 

Martin Tapp commented on SPARK-4940:


No real workaround for now as we need the round-robin strategy. You can beef up 
the executor allocated memory to prevent OOM.

Mesos features are catching up with Yarn on some front. But Mesos offers better 
docker support and is more general purpose for maximizing out cluster resources.

> Support more evenly distributing cores for Mesos mode
> -
>
> Key: SPARK-4940
> URL: https://issues.apache.org/jira/browse/SPARK-4940
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Reporter: Timothy Chen
> Attachments: mesos-config-difference-3nodes-vs-2nodes.png
>
>
> Currently in Coarse grain mode the spark scheduler simply takes all the 
> resources it can on each node, but can cause uneven distribution based on 
> resources available on each slave.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11286) Make Outbox stopped exception singleton

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11286:


Assignee: (was: Apache Spark)

> Make Outbox stopped exception singleton
> ---
>
> Key: SPARK-11286
> URL: https://issues.apache.org/jira/browse/SPARK-11286
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Trivial
>
> In two places in Outbox.scala , new SparkException is created for Outbox 
> stopped condition.
> Create a singleton for Outbox stopped exception and use it instead of 
> creating exception every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11286) Make Outbox stopped exception singleton

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11286:


Assignee: Apache Spark

> Make Outbox stopped exception singleton
> ---
>
> Key: SPARK-11286
> URL: https://issues.apache.org/jira/browse/SPARK-11286
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Assignee: Apache Spark
>Priority: Trivial
>
> In two places in Outbox.scala , new SparkException is created for Outbox 
> stopped condition.
> Create a singleton for Outbox stopped exception and use it instead of 
> creating exception every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11286) Make Outbox stopped exception singleton

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971307#comment-14971307
 ] 

Apache Spark commented on SPARK-11286:
--

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/9254

> Make Outbox stopped exception singleton
> ---
>
> Key: SPARK-11286
> URL: https://issues.apache.org/jira/browse/SPARK-11286
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ted Yu
>Priority: Trivial
>
> In two places in Outbox.scala , new SparkException is created for Outbox 
> stopped condition.
> Create a singleton for Outbox stopped exception and use it instead of 
> creating exception every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971361#comment-14971361
 ] 

Ryan Williams commented on SPARK-11162:
---

Do you know how I might enable debug logging with -D flags?


> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11278) PageRank fails with unified memory manager

2015-10-23 Thread Michael Malak (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Malak updated SPARK-11278:
--
Component/s: GraphX

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Nishkam Ravi
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11287) Executing deploy.client TestClient fails with bad class name

2015-10-23 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-11287:


 Summary: Executing deploy.client TestClient fails with bad class 
name
 Key: SPARK-11287
 URL: https://issues.apache.org/jira/browse/SPARK-11287
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.5.1
Reporter: Bryan Cutler
Priority: Trivial


Execution of deploy.client.TestClient creates an ApplicationDescription to 
start a TestExecutor which fails due to a bad class name.  

Currently it is "spark.deploy.client.TestExecutor" but should be 
"org.apache.spark.deploy.client.TestExecutor".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11287) Executing deploy.client TestClient fails with bad class name

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11287:


Assignee: (was: Apache Spark)

> Executing deploy.client TestClient fails with bad class name
> 
>
> Key: SPARK-11287
> URL: https://issues.apache.org/jira/browse/SPARK-11287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Execution of deploy.client.TestClient creates an ApplicationDescription to 
> start a TestExecutor which fails due to a bad class name.  
> Currently it is "spark.deploy.client.TestExecutor" but should be 
> "org.apache.spark.deploy.client.TestExecutor".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11287) Executing deploy.client TestClient fails with bad class name

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971454#comment-14971454
 ] 

Apache Spark commented on SPARK-11287:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/9255

> Executing deploy.client TestClient fails with bad class name
> 
>
> Key: SPARK-11287
> URL: https://issues.apache.org/jira/browse/SPARK-11287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Execution of deploy.client.TestClient creates an ApplicationDescription to 
> start a TestExecutor which fails due to a bad class name.  
> Currently it is "spark.deploy.client.TestExecutor" but should be 
> "org.apache.spark.deploy.client.TestExecutor".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11287) Executing deploy.client TestClient fails with bad class name

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11287:


Assignee: Apache Spark

> Executing deploy.client TestClient fails with bad class name
> 
>
> Key: SPARK-11287
> URL: https://issues.apache.org/jira/browse/SPARK-11287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>
> Execution of deploy.client.TestClient creates an ApplicationDescription to 
> start a TestExecutor which fails due to a bad class name.  
> Currently it is "spark.deploy.client.TestExecutor" but should be 
> "org.apache.spark.deploy.client.TestExecutor".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11286) Make Outbox stopped exception singleton

2015-10-23 Thread Ted Yu (JIRA)
Ted Yu created SPARK-11286:
--

 Summary: Make Outbox stopped exception singleton
 Key: SPARK-11286
 URL: https://issues.apache.org/jira/browse/SPARK-11286
 Project: Spark
  Issue Type: Improvement
Reporter: Ted Yu
Priority: Trivial


In two places in Outbox.scala , new SparkException is created for Outbox 
stopped condition.

Create a singleton for Outbox stopped exception and use it instead of creating 
exception every time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10975) Shuffle files left behind on Mesos without dynamic allocation

2015-10-23 Thread Iulian Dragos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971268#comment-14971268
 ] 

Iulian Dragos commented on SPARK-10975:
---

No, it's not a duplicate, but fixed by the same PR :)

> Shuffle files left behind on Mesos without dynamic allocation
> -
>
> Key: SPARK-10975
> URL: https://issues.apache.org/jira/browse/SPARK-10975
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.5.1
>Reporter: Iulian Dragos
>Priority: Blocker
>
> (from mailing list)
> Running on Mesos in coarse-grained mode. No dynamic allocation or shuffle 
> service. 
> I see that there are two types of temporary files under /tmp folder 
> associated with every executor: /tmp/spark- and /tmp/blockmgr-. 
> When job is finished /tmp/spark- is gone, but blockmgr directory is 
> left with all gigabytes in it. 
> The reason is that logic to clean up files is only enabled when the shuffle 
> service is running, see https://github.com/apache/spark/pull/7820
> The shuffle files should be placed in the Mesos sandbox or under `tmp/spark` 
> unless the shuffle service is enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11258) Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory

2015-10-23 Thread Frank Rosner (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Frank Rosner updated SPARK-11258:
-
Description: 
h4. Problem

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and then 
{{.toArray}} which might cause the problem.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect might be due to garbage collection, especially if you 
consider that the old implementation didn't complete on an even bigger data 
frame.

  was:
h4. Problem

We tried to collect a DataFrame with > 1 million rows and a few hundred columns 
in SparkR. This took a huge amount of time (much more than in the Spark REPL). 
When looking into the code, I found that the 
{{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and then 
{{.toArray}} which might cause the problem.

h4. Solution

Directly transpose the row wise representation to the column wise 
representation with one pass through the data. I will create a pull request for 
this.

h4. Runtime comparison

On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
method takes average 2267 ms to complete. My implementation takes only 554 ms 
on average. This effect gets even bigger, the more columns you have.


> Converting a Spark DataFrame into an R data.frame is slow / requires a lot of 
> memory
> 
>
> Key: SPARK-11258
> URL: https://issues.apache.org/jira/browse/SPARK-11258
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Frank Rosner
>
> h4. Problem
> We tried to collect a DataFrame with > 1 million rows and a few hundred 
> columns in SparkR. This took a huge amount of time (much more than in the 
> Spark REPL). When looking into the code, I found that the 
> {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and 
> then {{.toArray}} which might cause the problem.
> h4. Solution
> Directly transpose the row wise representation to the column wise 
> representation with one pass through the data. I will create a pull request 
> for this.
> h4. Runtime comparison
> On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} 
> method takes average 2267 ms to complete. My implementation takes only 554 ms 
> on average. This effect might be due to garbage collection, especially if you 
> consider that the old implementation didn't complete on an even bigger data 
> frame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971438#comment-14971438
 ] 

Sean Owen commented on SPARK-11162:
---

Hm, can you set logger levels with syntax like {{-Dlog4j.logger.com.foo=WARN}}? 
That's what I'm thinking of, at least. I know you can specify a config file 
this way.

> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager

2015-10-23 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971442#comment-14971442
 ] 

Andrew Or commented on SPARK-11278:
---

are there any exceptions in the executor logs? Does the problem go away if you 
run it again with `spark.memory.useLegacyMode = true`?

> PageRank fails with unified memory manager
> --
>
> Key: SPARK-11278
> URL: https://issues.apache.org/jira/browse/SPARK-11278
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX, Spark Core
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Nishkam Ravi
>
> PageRank (6-nodes, 32GB input) runs very slow and eventually fails with 
> ExecutorLostFailure. Traced it back to the 'unified memory manager' commit 
> from Oct 13th. Took a quick look at the code and couldn't see the problem 
> (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to 
> spot the problem quickly. Can be reproduced by running PageRank on a large 
> enough input dataset if needed. Sorry for not being of much help here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11288) Specify the return type for UDF in Scala

2015-10-23 Thread Davies Liu (JIRA)
Davies Liu created SPARK-11288:
--

 Summary: Specify the return type for UDF in Scala
 Key: SPARK-11288
 URL: https://issues.apache.org/jira/browse/SPARK-11288
 Project: Spark
  Issue Type: New Feature
Reporter: Davies Liu


The return type is figured out from the function signature, maybe it's not that 
user want, for example, the default DecimalType is (38, 18), user may want (38, 
0).

The older deprecated one callUDF can do that, we should figure out  a way to 
support that.

cc [~marmbrus]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable

2015-10-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-11294:
--
Assignee: Felix Cheung

> Improve R doc for read.df, write.df, saveAsTable
> 
>
> Key: SPARK-11294
> URL: https://issues.apache.org/jira/browse/SPARK-11294
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Assignee: Felix Cheung
>Priority: Minor
> Fix For: 1.5.2, 1.6.0
>
>
> API doc lacks example and has several formatting issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable

2015-10-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-11294.
---
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9261
[https://github.com/apache/spark/pull/9261]

> Improve R doc for read.df, write.df, saveAsTable
> 
>
> Key: SPARK-11294
> URL: https://issues.apache.org/jira/browse/SPARK-11294
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
> Fix For: 1.6.0, 1.5.2
>
>
> API doc lacks example and has several formatting issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11289) Substitute code examples in ML features with include_example

2015-10-23 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972345#comment-14972345
 ] 

Xusen Yin commented on SPARK-11289:
---

A feasible way to do it is create new example files in spark/examples, and move 
those code snippts from docs to there.

> Substitute code examples in ML features with include_example
> 
>
> Key: SPARK-11289
> URL: https://issues.apache.org/jira/browse/SPARK-11289
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Xusen Yin
>Priority: Minor
>
> Substitute code examples with include_example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable

2015-10-23 Thread Felix Cheung (JIRA)
Felix Cheung created SPARK-11294:


 Summary: Improve R doc for read.df, write.df, saveAsTable
 Key: SPARK-11294
 URL: https://issues.apache.org/jira/browse/SPARK-11294
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.5.1
Reporter: Felix Cheung
Priority: Minor


API doc lacks example and has several formatting issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1

2015-10-23 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972418#comment-14972418
 ] 

Sun Rui commented on SPARK-11255:
-

+1 for this request. Or we can update the supported R version to a newer 
version, but anyway, the version used in Jenkins should be same as the lowest 
version that claimed to be supported.

> R Test build should run on R 3.1.1
> --
>
> Key: SPARK-11255
> URL: https://issues.apache.org/jira/browse/SPARK-11255
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Minor
>
> Test should run on R 3.1.1 which is the version listed as supported.
> Apparently there are few R changes that can go undetected since Jenkins Test 
> build is running something newer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable

2015-10-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11294:


Assignee: (was: Apache Spark)

> Improve R doc for read.df, write.df, saveAsTable
> 
>
> Key: SPARK-11294
> URL: https://issues.apache.org/jira/browse/SPARK-11294
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> API doc lacks example and has several formatting issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable

2015-10-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972330#comment-14972330
 ] 

Apache Spark commented on SPARK-11294:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/9261

> Improve R doc for read.df, write.df, saveAsTable
> 
>
> Key: SPARK-11294
> URL: https://issues.apache.org/jira/browse/SPARK-11294
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Felix Cheung
>Priority: Minor
>
> API doc lacks example and has several formatting issues



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

2015-10-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11125.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9134
[https://github.com/apache/spark/pull/9134]

> Unreadable exception when running spark-sql without building with 
> -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
> --
>
> Key: SPARK-11125
> URL: https://issues.apache.org/jira/browse/SPARK-11125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.0
>
>
> In development environment, building spark without -Phive-thriftserver and 
> SPARK_PREPEND_CLASSES is set. The following exception is thrown.
> SparkSQLCliDriver can be loaded but hive related code could not be loaded.
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/cli/CliDriver
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.cli.CliDriver
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

2015-10-23 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11125:
-
Assignee: Jeff Zhang

> Unreadable exception when running spark-sql without building with 
> -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
> --
>
> Key: SPARK-11125
> URL: https://issues.apache.org/jira/browse/SPARK-11125
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Minor
> Fix For: 1.6.0
>
>
> In development environment, building spark without -Phive-thriftserver and 
> SPARK_PREPEND_CLASSES is set. The following exception is thrown.
> SparkSQLCliDriver can be loaded but hive related code could not be loaded.
> {code}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/hadoop/hive/cli/CliDriver
>   at java.lang.ClassLoader.defineClass1(Native Method)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
>   at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   at java.lang.Class.forName0(Native Method)
>   at java.lang.Class.forName(Class.java:270)
>   at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.hive.cli.CliDriver
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript

2015-10-23 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-10971.
---
   Resolution: Fixed
Fix Version/s: 1.5.2
   1.6.0

Issue resolved by pull request 9179
[https://github.com/apache/spark/pull/9179]

> sparkR: RRunner should allow setting path to Rscript
> 
>
> Key: SPARK-10971
> URL: https://issues.apache.org/jira/browse/SPARK-10971
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
> Fix For: 1.6.0, 1.5.2
>
>
> I'm running spark on yarn and trying to use R in cluster mode. RRunner seems 
> to just call Rscript and assumes its in the path. But on our YARN deployment 
> R isn't installed on the nodes so it needs to be distributed along with the 
> job and we need the ability to point to where it gets installed. sparkR in 
> client mode has the config spark.sparkr.r.command to point to Rscript. 
> RRunner should have something similar so it works in cluster mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >