[jira] [Resolved] (SPARK-11286) Make Outbox stopped exception singleton
[ https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu resolved SPARK-11286. Resolution: Won't Fix > Make Outbox stopped exception singleton > --- > > Key: SPARK-11286 > URL: https://issues.apache.org/jira/browse/SPARK-11286 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Priority: Trivial > > In two places in Outbox.scala , new SparkException is created for Outbox > stopped condition. > Create a singleton for Outbox stopped exception and use it instead of > creating exception every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971520#comment-14971520 ] Ryan Williams commented on SPARK-11162: --- Makes sense, some googling has left me with the impression that log4j doesn't support this, so I guess I'll just modify {{log4j.properties}} going forward, thanks. Answering my other question from earlier: modifying {{conf/log4j.properties}} seems to work in yarn-client mode; I guess I'd only tried {{$SPARK_HOME/log4j.properties}} previously. > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11289) Substitute code examples in ML features with include_example
[ https://issues.apache.org/jira/browse/SPARK-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xusen Yin updated SPARK-11289: -- Description: Substitute code examples with include_example. (was: [~mengxr] I have one question, there are some code examples in the doc that does not exist in our example code dir. How to solve the problem? Should I add new examples in the examples/src/main/scala/org/apache/spark/examples to support those docs?) > Substitute code examples in ML features with include_example > > > Key: SPARK-11289 > URL: https://issues.apache.org/jira/browse/SPARK-11289 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xusen Yin >Priority: Minor > > Substitute code examples with include_example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11289) Substitute code examples in ML features with include_example
[ https://issues.apache.org/jira/browse/SPARK-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971536#comment-14971536 ] Xusen Yin commented on SPARK-11289: --- [~mengxr] I have one question, there are some code examples in the doc that does not exist in our example code dir. How to solve the problem? Should I add new examples in the examples/src/main/scala/org/apache/spark/examples to support those docs? > Substitute code examples in ML features with include_example > > > Key: SPARK-11289 > URL: https://issues.apache.org/jira/browse/SPARK-11289 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xusen Yin >Priority: Minor > > Substitute code examples with include_example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5210) Support log rolling in EventLogger
[ https://issues.apache.org/jira/browse/SPARK-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970670#comment-14970670 ] Apache Spark commented on SPARK-5210: - User 'XuTingjun' has created a pull request for this issue: https://github.com/apache/spark/pull/9246 > Support log rolling in EventLogger > -- > > Key: SPARK-5210 > URL: https://issues.apache.org/jira/browse/SPARK-5210 > Project: Spark > Issue Type: New Feature > Components: Spark Core, Web UI >Reporter: Josh Rosen > > For long-running Spark applications (e.g. running for days / weeks), the > Spark event log may grow to be very large. > As a result, it would be useful if EventLoggingListener supported log file > rolling / rotation. Adding this feature will involve changes to the > HistoryServer in order to be able to load event logs from a sequence of files > instead of a single file. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11289) Substitute code examples in ML features with include_example
Xusen Yin created SPARK-11289: - Summary: Substitute code examples in ML features with include_example Key: SPARK-11289 URL: https://issues.apache.org/jira/browse/SPARK-11289 Project: Spark Issue Type: Sub-task Reporter: Xusen Yin Priority: Minor [~mengxr] I have one question, there are some code examples in the doc that does not exist in our example code dir. How to solve the problem? Should I add new examples in the examples/src/main/scala/org/apache/spark/examples to support those docs? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11277) sort_array throws exception scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11277: Assignee: Apache Spark > sort_array throws exception scala.MatchError > > > Key: SPARK-11277 > URL: https://issues.apache.org/jira/browse/SPARK-11277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Jia Li >Assignee: Apache Spark >Priority: Minor > > I was trying out the sort_array function then hit this exception. > I looked into the spark source code. I found the root cause is that > sort_array does not check for an array of NULLs. It's not meaningful to sort > an array of entirely NULLs anyway. > I already have a fix for this issue and I'm going to create a pull request > for it. > scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show() > scala.MatchError: ArrayType(NullType,true) (of class > org.apache.spark.sql.types.ArrayType) > at > org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68) > at > org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67) > at > org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11279) Add DataFrame#toDF in PySpark
Jeff Zhang created SPARK-11279: -- Summary: Add DataFrame#toDF in PySpark Key: SPARK-11279 URL: https://issues.apache.org/jira/browse/SPARK-11279 Project: Spark Issue Type: Improvement Components: PySpark Reporter: Jeff Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10857) SQL injection bug in JdbcDialect.getTableExistsQuery()
[ https://issues.apache.org/jira/browse/SPARK-10857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970836#comment-14970836 ] Sean Owen commented on SPARK-10857: --- Rick you're saying that this code path only comes up when the parser is certainly dealing with a table name, like in DDL statements? and not just in parsing "SELECT * from (table)"? (You probably know the code best here given you've studied it at close range.) > SQL injection bug in JdbcDialect.getTableExistsQuery() > -- > > Key: SPARK-10857 > URL: https://issues.apache.org/jira/browse/SPARK-10857 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: Rick Hillegas >Priority: Minor > > All of the implementations of this method involve constructing a query by > concatenating boilerplate text with a user-supplied name. This looks like a > SQL injection bug to me. > A better solution would be to call java.sql.DatabaseMetaData.getTables() to > implement this method, using the catalog and schema which are available from > Connection.getCatalog() and Connection.getSchema(). This would not work on > Java 6 because Connection.getSchema() was introduced in Java 7. However, the > solution would work for more modern JVMs. Limiting the vulnerability to > obsolete JVMs would at least be an improvement over the current situation. > Java 6 has been end-of-lifed and is not an appropriate platform for users who > are concerned about security. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
[ https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11229. --- Resolution: Cannot Reproduce Fix Version/s: (was: 1.6.0) > NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0 > - > > Key: SPARK-11229 > URL: https://issues.apache.org/jira/browse/SPARK-11229 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux >Reporter: Romi Kuntsman > > Steps to reproduce: > 1. set spark.shuffle.memoryFraction=0 > 2. load dataframe from parquet file > 3. see it's read correctly by calling dataframe.show() > 4. call dataframe.count() > Expected behaviour: > get count of rows in dataframe > OR, if memoryFraction=0 is an invalid setting, get notified about it > Actual behaviour: > CatalystReadSupport doesn't read the schema (even thought there is one) and > then there's a NullPointerException. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.start(TungstenAggregationIterator.scala:622) > at >
[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures
[ https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970839#comment-14970839 ] Maciej Szymkiewicz commented on SPARK-11167: spark-csv has a much simpler job to do and everything it does is already covered by basic R behavior. Tightest type here would probably most likely mean Any which is neither allowed or useful. I think the best solution in this case could be a warning when data frame contains complex types and user doesn't provide schema. And maybe some tool which could replace debug.TypeCheck. Anyone can explain why it 'no longer applies in the new "Tungsten" world'? https://github.com/apache/spark/pull/8043 > Incorrect type resolution on heterogeneous data structures > -- > > Key: SPARK-11167 > URL: https://issues.apache.org/jira/browse/SPARK-11167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Maciej Szymkiewicz > > If structure contains heterogeneous incorrectly assigns type of the > encountered element as type of a whole structure. This problem affects both > lists: > {code} > SparkR:::infer_type(list(a=1, b="a") > ## [1] "array" > SparkR:::infer_type(list(a="a", b=1)) > ## [1] "array" > {code} > and environments: > {code} > SparkR:::infer_type(as.environment(list(a=1, b="a"))) > ## [1] "map" > SparkR:::infer_type(as.environment(list(a="a", b=1))) > ## [1] "map " > {code} > This results in errors during data collection and other operations on > DataFrames: > {code} > ldf <- data.frame(row.names=1:2) > ldf$foo <- list(list("1", 2), list(3, 4)) > sdf <- createDataFrame(sqlContext, ldf) > collect(sdf) > ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID > 9) > ## scala.MatchError: 2.0 (of class java.lang.Double) > ## ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971188#comment-14971188 ] Steve Loughran commented on SPARK-11265: Pull request is : https://github.com/apache/spark/pull/9232 > YarnClient can't get tokens to talk to Hive in a secure cluster > --- > > Key: SPARK-11265 > URL: https://issues.apache.org/jira/browse/SPARK-11265 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 > Environment: Kerberized Hadoop cluster >Reporter: Steve Loughran > > As reported on the dev list, trying to run a YARN client which wants to talk > to Hive in a Kerberized hadoop cluster fails. This appears to be because the > constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was > made private and replaced with a factory method. The YARN client uses > reflection to get the tokens, so the signature changes weren't picked up in > SPARK-8064. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10277) Add @since annotation to pyspark.mllib.regression
[ https://issues.apache.org/jira/browse/SPARK-10277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10277. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 8684 [https://github.com/apache/spark/pull/8684] > Add @since annotation to pyspark.mllib.regression > - > > Key: SPARK-10277 > URL: https://issues.apache.org/jira/browse/SPARK-10277 > Project: Spark > Issue Type: Sub-task > Components: Documentation, MLlib, PySpark >Reporter: Xiangrui Meng >Assignee: Yu Ishikawa >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971217#comment-14971217 ] Apache Spark commented on SPARK-7970: - User 'nitin2goyal' has created a pull request for this issue: https://github.com/apache/spark/pull/9253 > Optimize code for SQL queries fired on Union of RDDs (closure cleaner) > -- > > Key: SPARK-7970 > URL: https://issues.apache.org/jira/browse/SPARK-7970 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Nitin Goyal > Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot > 2015-05-27 at 11.07.02 pm.png > > > Closure cleaner slows down the execution of Spark SQL queries fired on union > of RDDs. The time increases linearly at driver side with number of RDDs > unioned. Refer following thread for more context :- > http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html > As can be seen in attached screenshots of Jprofiler, lot of time is getting > consumed in "getClassReader" method of ClosureCleaner and rest in > "ensureSerializable" (atleast in my case) > This can be fixed in two ways (as per my current understanding) :- > 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create > MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls > ClosureCleaner clean method (See PR - > https://github.com/apache/spark/pull/6256). > 2. Fix at Spark core level - > (i) Make "checkSerializable" property driven in SparkContext's clean method > (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7970: --- Assignee: (was: Apache Spark) > Optimize code for SQL queries fired on Union of RDDs (closure cleaner) > -- > > Key: SPARK-7970 > URL: https://issues.apache.org/jira/browse/SPARK-7970 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Nitin Goyal > Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot > 2015-05-27 at 11.07.02 pm.png > > > Closure cleaner slows down the execution of Spark SQL queries fired on union > of RDDs. The time increases linearly at driver side with number of RDDs > unioned. Refer following thread for more context :- > http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html > As can be seen in attached screenshots of Jprofiler, lot of time is getting > consumed in "getClassReader" method of ClosureCleaner and rest in > "ensureSerializable" (atleast in my case) > This can be fixed in two ways (as per my current understanding) :- > 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create > MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls > ClosureCleaner clean method (See PR - > https://github.com/apache/spark/pull/6256). > 2. Fix at Spark core level - > (i) Make "checkSerializable" property driven in SparkContext's clean method > (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7970) Optimize code for SQL queries fired on Union of RDDs (closure cleaner)
[ https://issues.apache.org/jira/browse/SPARK-7970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7970: --- Assignee: Apache Spark > Optimize code for SQL queries fired on Union of RDDs (closure cleaner) > -- > > Key: SPARK-7970 > URL: https://issues.apache.org/jira/browse/SPARK-7970 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 1.2.0, 1.3.0 >Reporter: Nitin Goyal >Assignee: Apache Spark > Attachments: Screen Shot 2015-05-27 at 11.01.03 pm.png, Screen Shot > 2015-05-27 at 11.07.02 pm.png > > > Closure cleaner slows down the execution of Spark SQL queries fired on union > of RDDs. The time increases linearly at driver side with number of RDDs > unioned. Refer following thread for more context :- > http://apache-spark-developers-list.1001551.n3.nabble.com/ClosureCleaner-slowing-down-Spark-SQL-queries-tt12466.html > As can be seen in attached screenshots of Jprofiler, lot of time is getting > consumed in "getClassReader" method of ClosureCleaner and rest in > "ensureSerializable" (atleast in my case) > This can be fixed in two ways (as per my current understanding) :- > 1. Fixed at Spark SQL level - As pointed out by yhuai, we can create > MapPartitionsRDD idirectly nstead of doing rdd.mapPartitions which calls > ClosureCleaner clean method (See PR - > https://github.com/apache/spark/pull/6256). > 2. Fix at Spark core level - > (i) Make "checkSerializable" property driven in SparkContext's clean method > (ii) Somehow cache classreader for last 'n' classes -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228 ] Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:01 PM: I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Thanks! was (Author: superwai): I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal even I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Thanks! > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11265: Assignee: Apache Spark > YarnClient can't get tokens to talk to Hive in a secure cluster > --- > > Key: SPARK-11265 > URL: https://issues.apache.org/jira/browse/SPARK-11265 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 > Environment: Kerberized Hadoop cluster >Reporter: Steve Loughran >Assignee: Apache Spark > > As reported on the dev list, trying to run a YARN client which wants to talk > to Hive in a Kerberized hadoop cluster fails. This appears to be because the > constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was > made private and replaced with a factory method. The YARN client uses > reflection to get the tokens, so the signature changes weren't picked up in > SPARK-8064. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971190#comment-14971190 ] Apache Spark commented on SPARK-11265: -- User 'steveloughran' has created a pull request for this issue: https://github.com/apache/spark/pull/9232 > YarnClient can't get tokens to talk to Hive in a secure cluster > --- > > Key: SPARK-11265 > URL: https://issues.apache.org/jira/browse/SPARK-11265 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 > Environment: Kerberized Hadoop cluster >Reporter: Steve Loughran > > As reported on the dev list, trying to run a YARN client which wants to talk > to Hive in a Kerberized hadoop cluster fails. This appears to be because the > constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was > made private and replaced with a factory method. The YARN client uses > reflection to get the tokens, so the signature changes weren't picked up in > SPARK-8064. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11265: Assignee: (was: Apache Spark) > YarnClient can't get tokens to talk to Hive in a secure cluster > --- > > Key: SPARK-11265 > URL: https://issues.apache.org/jira/browse/SPARK-11265 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 > Environment: Kerberized Hadoop cluster >Reporter: Steve Loughran > > As reported on the dev list, trying to run a YARN client which wants to talk > to Hive in a Kerberized hadoop cluster fails. This appears to be because the > constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was > made private and replaced with a factory method. The YARN client uses > reflection to get the tokens, so the signature changes weren't picked up in > SPARK-8064. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6723) Model import/export for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6723. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 6785 [https://github.com/apache/spark/pull/6785] > Model import/export for ChiSqSelector > - > > Key: SPARK-6723 > URL: https://issues.apache.org/jira/browse/SPARK-6723 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10610) Using AppName instead of AppId in the name of all metrics
[ https://issues.apache.org/jira/browse/SPARK-10610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yi Tian updated SPARK-10610: Summary: Using AppName instead of AppId in the name of all metrics (was: Using AppName instead AppId in the name of all metrics) > Using AppName instead of AppId in the name of all metrics > - > > Key: SPARK-10610 > URL: https://issues.apache.org/jira/browse/SPARK-10610 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.5.0 >Reporter: Yi Tian >Priority: Minor > > When we using {{JMX}} to monitor spark system, We have to configure the name > of target metrics in the monitor system. But the current name of metrics is > {{appId}} + {{executorId}} + {{source}} . So when the spark program > restarted, we have to update the name of metrics in the monitor system. > We should add an optional configuration property to control whether using the > appName instead of appId in spark metrics system. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228 ] Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:02 PM: I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Also, I notice that there is much better features on Spark with Yarn. Does it mean it is better to run spark on Yarn than Mesos? Thanks! was (Author: superwai): I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Thanks! > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228 ] Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:15 PM: I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. It is very difficult to use because an executor configures with 10GB ram could have 20 tasks or 1 task allocated to it (assuming 1 cpu per task). Say each task could use up to 2GB of RAM, it would be a OOM for 20 tasks (40GB required) and underutilized for 1 task (2GB required). Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Also, I notice that there is much better features on Spark with Yarn. Does it mean it is better to run spark on Yarn than Mesos? Thanks! was (Author: superwai): I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Also, I notice that there is much better features on Spark with Yarn. Does it mean it is better to run spark on Yarn than Mesos? Thanks! > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228 ] Jerry Lam edited comment on SPARK-4940 at 10/23/15 4:16 PM: I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. It is very difficult to use (or it is impossible to use?) because an executor configures with 10GB ram could have 20 tasks or 1 task allocated to it (assuming 1 cpu per task). Say each task could use up to 2GB of RAM, it would be a OOM for 20 tasks (40GB required) and underutilized for 1 task (2GB required). Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Also, I notice that there is much better features on Spark with Yarn. Does it mean it is better to run spark on Yarn than Mesos? Thanks! was (Author: superwai): I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal when I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. It is very difficult to use because an executor configures with 10GB ram could have 20 tasks or 1 task allocated to it (assuming 1 cpu per task). Say each task could use up to 2GB of RAM, it would be a OOM for 20 tasks (40GB required) and underutilized for 1 task (2GB required). Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Also, I notice that there is much better features on Spark with Yarn. Does it mean it is better to run spark on Yarn than Mesos? Thanks! > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6723) Model import/export for ChiSqSelector
[ https://issues.apache.org/jira/browse/SPARK-6723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-6723: - Assignee: Jayant Shekhar > Model import/export for ChiSqSelector > - > > Key: SPARK-6723 > URL: https://issues.apache.org/jira/browse/SPARK-6723 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Assignee: Jayant Shekhar >Priority: Minor > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971228#comment-14971228 ] Jerry Lam commented on SPARK-4940: -- I just want to weight in the importance of this issue. My observation is that using coarse grained mode, it is possible that if I configure total core max to 20, I could end up having ONE executor with 20 cores. This is not ideal even I have 5 slaves with 32 cores each. It would makes more sense to have ONE executor per slave and each executor has 4 cores. Is there a workaround at this moment using Spark 1.5.1. to make load more evenly distributed on mesos. How people actually use spark on mesos when the resource is not distributed evenly? Thanks! > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10975) Shuffle files left behind on Mesos without dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971265#comment-14971265 ] Chris Bannister commented on SPARK-10975: - Spark will use MESOS_DIRECTORY sandbox when not using shuffle service now that SPARK-9708 is merged. Is this a duplicate? > Shuffle files left behind on Mesos without dynamic allocation > - > > Key: SPARK-10975 > URL: https://issues.apache.org/jira/browse/SPARK-10975 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1 >Reporter: Iulian Dragos >Priority: Blocker > > (from mailing list) > Running on Mesos in coarse-grained mode. No dynamic allocation or shuffle > service. > I see that there are two types of temporary files under /tmp folder > associated with every executor: /tmp/spark- and /tmp/blockmgr-. > When job is finished /tmp/spark- is gone, but blockmgr directory is > left with all gigabytes in it. > The reason is that logic to clean up files is only enabled when the shuffle > service is running, see https://github.com/apache/spark/pull/7820 > The shuffle files should be placed in the Mesos sandbox or under `tmp/spark` > unless the shuffle service is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11280) Mesos cluster deployment using only one node
[ https://issues.apache.org/jira/browse/SPARK-11280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iulian Dragos updated SPARK-11280: -- Attachment: Screen Shot 2015-10-23 at 11.37.43.png > Mesos cluster deployment using only one node > > > Key: SPARK-11280 > URL: https://issues.apache.org/jira/browse/SPARK-11280 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1, 1.6.0 >Reporter: Iulian Dragos > Attachments: Screen Shot 2015-10-23 at 11.37.43.png > > > I submit the SparkPi example in Mesos cluster mode, and I notice that all > tasks fail except the ones that run on the same node as the driver. The > others fail with > {code} > sh: 1: > /tmp/mesos/slaves/1521e408-d8fe-416d-898b-3801e73a8293-S0/frameworks/1521e408-d8fe-416d-898b-3801e73a8293-0003/executors/driver-20151023113121-0006/runs/2abefd29-7386-4d81-a025-9d794780db23/spark-1.5.0-bin-hadoop2.6/bin/spark-class: > not found > {code} > The path exists only on the machine that launched the driver, and the sandbox > of the executor where this task died is completely empty. > I launch the task like this: > {code} > $ spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 > --conf > spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.5.0-bin-hadoop2.6.tgz; > --conf spark.mesos.coarse=true --class org.apache.spark.examples.SparkPi > ftp://sagitarius.local/ftp/spark-examples-1.5.0-hadoop2.6.0.jar > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request to launch > an application in mesos://sagitarius.local:7077. > 15/10/23 11:31:21 INFO RestSubmissionClient: Submission successfully created > as driver-20151023113121-0006. Polling submission state... > 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request for the > status of submission driver-20151023113121-0006 in > mesos://sagitarius.local:7077. > 15/10/23 11:31:21 INFO RestSubmissionClient: State of driver > driver-20151023113121-0006 is now QUEUED. > 15/10/23 11:31:21 INFO RestSubmissionClient: Server responded with > CreateSubmissionResponse: > { > "action" : "CreateSubmissionResponse", > "serverSparkVersion" : "1.5.0", > "submissionId" : "driver-20151023113121-0006", > "success" : true > } > {code} > I can see the driver in the Dispatcher UI and the job succeeds eventually, > but running only on the node where the driver was launched (see attachment). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7021) JUnit output for Python tests
[ https://issues.apache.org/jira/browse/SPARK-7021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7021: - Assignee: Gabor Liptak > JUnit output for Python tests > - > > Key: SPARK-7021 > URL: https://issues.apache.org/jira/browse/SPARK-7021 > Project: Spark > Issue Type: Improvement > Components: Build, Project Infra >Reporter: Brennon York >Assignee: Gabor Liptak >Priority: Minor > Labels: starter > Fix For: 1.6.0 > > > Currently python returns its test output in its own format. What would be > preferred is if the Python test runner could output its test results in JUnit > format to better match the rest of the Jenkins test output. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970917#comment-14970917 ] Russell Pierce edited comment on SPARK-9325 at 10/23/15 12:53 PM: -- You're right, Spark had been producing an error because the df$col in question was a TINYINT stored in Parquet, not that the command itself didn't work; that problem seems to have been addressed in another Issue (https://issues.apache.org/jira/browse/SPARK-3575). was (Author: rpierce): You're right, Spark had been producing an error because the df$col in question was a TINYINT stored in Parquet, not that the command itself didn't work. > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11282: --- Description: Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: {code} spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] {code} Please find example code attached. was: Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] Please find example code attached. > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > {code} > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py > true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > {code} > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11282. --- Resolution: Duplicate [~maver1ck] this could use a better title, and there is no code attached. I also strongly suspect it duplicates https://issues.apache.org/jira/browse/SPARK-10914 > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root >|-- id: long (nullable = true) >|-- val: long (nullable = true) >|-- id2: long (nullable = true) >|-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root >|-- id: long (nullable = true) >|-- val: long (nullable = true) >|-- id2: long (nullable = true) >|-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11265) YarnClient can't get tokens to talk to Hive in a secure cluster
[ https://issues.apache.org/jira/browse/SPARK-11265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970930#comment-14970930 ] Steve Loughran commented on SPARK-11265: I can trigger a failure in a unit test now, once you get pass Hive failing to load (classpath issue), the {{get()}} operation fails {code} obtain Tokens For HiveMetastore *** FAILED *** java.lang.IllegalArgumentException: wrong number of arguments at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil.obtainTokenForHiveMetastoreInner(YarnSparkHadoopUtil.scala:203) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtilSuite$$anonfun$22.apply(YarnSparkHadoopUtilSuite.scala:254) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtilSuite$$anonfun$22.apply(YarnSparkHadoopUtilSuite.scala:249) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) {code} > YarnClient can't get tokens to talk to Hive in a secure cluster > --- > > Key: SPARK-11265 > URL: https://issues.apache.org/jira/browse/SPARK-11265 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.1 > Environment: Kerberized Hadoop cluster >Reporter: Steve Loughran > > As reported on the dev list, trying to run a YARN client which wants to talk > to Hive in a Kerberized hadoop cluster fails. This appears to be because the > constructor of the {{ org.apache.hadoop.hive.ql.metadata.Hive}} class was > made private and replaced with a factory method. The YARN client uses > reflection to get the tokens, so the signature changes weren't picked up in > SPARK-8064. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970931#comment-14970931 ] Maciej Bryński commented on SPARK-11282: We had race condition here. I was attaching file when you answered. I'll try solution of 10914 > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > {code} > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py > true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > {code} > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970952#comment-14970952 ] Maciej Bryński commented on SPARK-11282: UPDATE: Looks like: -XX:-UseCompressedOops solve the problem. > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > {code} > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py > true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > {code} > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6847) Stack overflow on updateStateByKey which followed by a dstream with checkpoint set
[ https://issues.apache.org/jira/browse/SPARK-6847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971004#comment-14971004 ] Glyton Camilleri commented on SPARK-6847: - Hi, we managed to actually get rid of the overflow issues by settings checkpoints on more streams than we thought we needed to, in addition to implementing a small change following your suggestion; before the fix, the setup was similar to what you describe: {code} val dStream1 = // create kafka stream and do some preprocessing val dStream2 = dStream1.updateStateByKey { func }.checkpoint(timeWindow * 2) val dStream3 = dStream2.map { ... } // (1) perform some side-effect on the state if (certainConditionsAreMet) dStream2.foreachRDD { _.foreachPartition { ... } } // (2) publish final results to a set of Kafka topics dStream3.transform { ... }.foreachRDD { _.foreachPartition { ... } } {code} There were two things we did: a) set different checkpoints for {{dStream2}} and {{dStream3}}, whereas before we were only setting the checkpoint for {{dStream2}} b) changed (1) above such then when {{!certainConditionsAreMet}}, we just consume the stream like you describe in your suggestion I honestly think that b) was more likely to be influential in removing the StackOverflowError really, but we decided to leave the checkpoint settings in a) there anyway. Apologies for the late follow-up, but we needed to make sure the issue had actually been resolved. > Stack overflow on updateStateByKey which followed by a dstream with > checkpoint set > -- > > Key: SPARK-6847 > URL: https://issues.apache.org/jira/browse/SPARK-6847 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.3.0 >Reporter: Jack Hu > Labels: StackOverflowError, Streaming > > The issue happens with the following sample code: uses {{updateStateByKey}} > followed by a {{map}} with checkpoint interval 10 seconds > {code} > val sparkConf = new SparkConf().setAppName("test") > val streamingContext = new StreamingContext(sparkConf, Seconds(10)) > streamingContext.checkpoint("""checkpoint""") > val source = streamingContext.socketTextStream("localhost", ) > val updatedResult = source.map( > (1,_)).updateStateByKey( > (newlist : Seq[String], oldstate : Option[String]) => > newlist.headOption.orElse(oldstate)) > updatedResult.map(_._2) > .checkpoint(Seconds(10)) > .foreachRDD((rdd, t) => { > println("Deep: " + rdd.toDebugString.split("\n").length) > println(t.toString() + ": " + rdd.collect.length) > }) > streamingContext.start() > streamingContext.awaitTermination() > {code} > From the output, we can see that the dependency will be increasing time over > time, the {{updateStateByKey}} never get check-pointed, and finally, the > stack overflow will happen. > Note: > * The rdd in {{updatedResult.map(_._2)}} get check-pointed in this case, but > not the {{updateStateByKey}} > * If remove the {{checkpoint(Seconds(10))}} from the map result ( > {{updatedResult.map(_._2)}} ), the stack overflow will not happen -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11283) List column gets additional level of nesting when converted to Spark DataFrame
Maciej Szymkiewicz created SPARK-11283: -- Summary: List column gets additional level of nesting when converted to Spark DataFrame Key: SPARK-11283 URL: https://issues.apache.org/jira/browse/SPARK-11283 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.0 Environment: R 3.2.2, Spark build from master 487d409e71767c76399217a07af8de1bb0da7aa8 Reporter: Maciej Szymkiewicz When input data frame contains list column there is an additional level of nesting in a Spark DataFrame and as a result collected data is no longer identical to input: {code} ldf <- data.frame(row.names=1:2) ldf$x <- list(list(1), list(2)) sdf <- createDataFrame(sqlContext, ldf) printSchema(sdf) ## root ## |-- x: array (nullable = true) ## ||-- element: array (containsNull = true) ## |||-- element: double (containsNull = true) identical(ldf, collect(sdf)) ## [1] FALSE {code} Comparing structure: Local df {code} unclass(ldf) ## $x ## $x[[1]] ## $x[[1]][[1]] ## [1] 1 ## ## $x[[2]] ## $x[[2]][[1]] ## [1] 2 ## ## attr(,"row.names") ## [1] 1 2 {code} Collected {code} unclass(collect(sdf)) ## $x ## $x[[1]] ## $x[[1]][[1]] ## $x[[1]][[1]][[1]] ## [1] 1 ## ## $x[[2]] ## $x[[2]][[1]] ## $x[[2]][[1]][[1]] ## [1] 2 ## ## attr(,"row.names") ## [1] 1 2 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns
[ https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970917#comment-14970917 ] Russell Pierce commented on SPARK-9325: --- You're right, Spark had been producing an error because the df$col in question was a TINYINT stored in Parquet, not that the command itself didn't work. > Support `collect` on DataFrame columns > -- > > Key: SPARK-9325 > URL: https://issues.apache.org/jira/browse/SPARK-9325 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Reporter: Shivaram Venkataraman > > This is to support code of the form > ``` > ages <- collect(df$Age) > ``` > Right now `df$Age` returns a Column, which has no functions supported. > Similarly we might consider supporting `head(df$Age)` etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures
[ https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970922#comment-14970922 ] Maciej Szymkiewicz commented on SPARK-11167: Related problem: https://issues.apache.org/jira/browse/SPARK-11281 > Incorrect type resolution on heterogeneous data structures > -- > > Key: SPARK-11167 > URL: https://issues.apache.org/jira/browse/SPARK-11167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Maciej Szymkiewicz > > If structure contains heterogeneous incorrectly assigns type of the > encountered element as type of a whole structure. This problem affects both > lists: > {code} > SparkR:::infer_type(list(a=1, b="a") > ## [1] "array" > SparkR:::infer_type(list(a="a", b=1)) > ## [1] "array" > {code} > and environments: > {code} > SparkR:::infer_type(as.environment(list(a=1, b="a"))) > ## [1] "map" > SparkR:::infer_type(as.environment(list(a="a", b=1))) > ## [1] "map " > {code} > This results in errors during data collection and other operations on > DataFrames: > {code} > ldf <- data.frame(row.names=1:2) > ldf$foo <- list(list("1", 2), list(3, 4)) > sdf <- createDataFrame(sqlContext, ldf) > collect(sdf) > ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID > 9) > ## scala.MatchError: 2.0 (of class java.lang.Double) > ## ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970915#comment-14970915 ] Jim Haughwout commented on SPARK-6270: -- [~tdas]: Can the team update this issue to reflect that this _also_ affects Versions 1.3.1, 1.4.0, 1.4.1, 1.5.0, and 1.5.1? > Standalone Master hangs when streaming job completes and event logging is > enabled > - > > Key: SPARK-6270 > URL: https://issues.apache.org/jira/browse/SPARK-6270 > Project: Spark > Issue Type: Bug > Components: Deploy, Streaming >Affects Versions: 1.2.0, 1.2.1, 1.3.0 >Reporter: Tathagata Das >Priority: Critical > > If the event logging is enabled, the Spark Standalone Master tries to > recreate the web UI of a completed Spark application from its event logs. > However if this event log is huge (e.g. for a Spark Streaming application), > then the master hangs in its attempt to read and recreate the web ui. This > hang causes the whole standalone cluster to be unusable. > Workaround is to disable the event logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11282: --- Description: Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] Please find example code attached. was: Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] Please find example code attached. > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root >|-- id: long (nullable = true) >|-- val: long (nullable = true) >|-- id2: long (nullable = true) >|-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root >|-- id: long (nullable = true) >|-- val: long (nullable = true) >|-- id2: long (nullable = true) >|-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971027#comment-14971027 ] Apache Spark commented on SPARK-11016: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/9243 > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11016: Assignee: (was: Apache Spark) > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11282: --- Description: Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] Please find example code attached. was: Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] Please find example code attached. > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root >|-- id: long (nullable = true) >|-- val: long (nullable = true) >|-- id2: long (nullable = true) >|-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root >|-- id: long (nullable = true) >|-- val: long (nullable = true) >|-- id2: long (nullable = true) >|-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Bryński updated SPARK-11282: --- Attachment: SPARK-11282.py > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py > true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970931#comment-14970931 ] Maciej Bryński edited comment on SPARK-11282 at 10/23/15 1:07 PM: -- We had race condition here. I was attaching file when you answered. You're probably right. I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914 was (Author: maver1ck): We had race condition here. I was attaching file when you answered. Uue're probably right. I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914 > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > {code} > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py > true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > {code} > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11282) Very strange broadcast join behaviour
[ https://issues.apache.org/jira/browse/SPARK-11282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970931#comment-14970931 ] Maciej Bryński edited comment on SPARK-11282 at 10/23/15 1:07 PM: -- We had race condition here. I was attaching file when you answered. Uue're probably right. I'll try solution of https://issues.apache.org/jira/browse/SPARK-10914 was (Author: maver1ck): We had race condition here. I was attaching file when you answered. I'll try solution of 10914 > Very strange broadcast join behaviour > - > > Key: SPARK-11282 > URL: https://issues.apache.org/jira/browse/SPARK-11282 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.1 >Reporter: Maciej Bryński >Priority: Critical > Attachments: SPARK-11282.py > > > Hi, > I found very strange broadcast join behaviour. > According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 > I'm using hint for broadcast join. (I patched 1.5.1 with > https://github.com/apache/spark/pull/8801/files ) > I found that working of this feature depends on Executor Memory. > In my case broadcast join is working up to 31G. > Example: > {code} > spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G > debug_broadcast_join.py true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=5, val2=5)] > spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py > true > Creating test tables... > Joining tables... > Joined table schema: > root > |-- id: long (nullable = true) > |-- val: long (nullable = true) > |-- id2: long (nullable = true) > |-- val2: long (nullable = true) > Selecting data for id = 5... > [Row(id=5, val=5, id2=None, val2=None)] > {code} > Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11258) Remove quadratic runtime complexity for converting a Spark DataFrame into an R data.frame
[ https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971025#comment-14971025 ] Frank Rosner commented on SPARK-11258: -- Actually I am pretty confused now. Thinking about it, having a for loop and a map should not be accessing every element more then one time. However, it still seems to be more complex than required to me. Let me try to reproduce the fact that we could not load it with the old function but with the new one. Maybe to .toArray method is a problem with memory as it is first recreating the whole shabang and then copying it to another array? > Remove quadratic runtime complexity for converting a Spark DataFrame into an > R data.frame > - > > Key: SPARK-11258 > URL: https://issues.apache.org/jira/browse/SPARK-11258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Frank Rosner > > h4. Introduction > We tried to collect a DataFrame with > 1 million rows and a few hundred > columns in SparkR. This took a huge amount of time (much more than in the > Spark REPL). When looking into the code, I found that the > {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method has quadratic run > time complexity (it goes through the complete data set _m_ times, where _m_ > is the number of columns. > h4. Problem > The {{dfToCols}} method is transposing the row-wise representation of the > Spark DataFrame (array of rows) into a column wise representation (array of > columns) to then be put into a data frame. This is done in a very inefficient > way, yielding to huge performance (and possibly also memory) problems when > collecting bigger data frames. > h4. Solution > Directly transpose the row wise representation to the column wise > representation with one pass through the data. I will create a pull request > for this. > h4. Runtime comparison > On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} > method takes average 2267 ms to complete. My implementation takes only 554 ms > on average. This effect gets even bigger, the more columns you have. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11282) Very strange broadcast join behaviour
Maciej Bryński created SPARK-11282: -- Summary: Very strange broadcast join behaviour Key: SPARK-11282 URL: https://issues.apache.org/jira/browse/SPARK-11282 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.5.1 Reporter: Maciej Bryński Priority: Critical Hi, I found very strange broadcast join behaviour. According to this Jira https://issues.apache.org/jira/browse/SPARK-10577 I'm using hint for broadcast join. (I patched 1.5.1 with https://github.com/apache/spark/pull/8801/files ) I found that working of this feature depends on Executor Memory. In my case broadcast join is working up to 31G. Example: spark1:~/ab$ ~/spark/bin/spark-submit --executor-memory 31G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=5, val2=5)] spark$ ~/spark/bin/spark-submit --executor-memory 32G debug_broadcast_join.py true Creating test tables... Joining tables... Joined table schema: root |-- id: long (nullable = true) |-- val: long (nullable = true) |-- id2: long (nullable = true) |-- val2: long (nullable = true) Selecting data for id = 5... [Row(id=5, val=5, id2=None, val2=None)] Please find example code attached. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11281) Issue with creating and collecting DataFrame using environments
Maciej Szymkiewicz created SPARK-11281: -- Summary: Issue with creating and collecting DataFrame using environments Key: SPARK-11281 URL: https://issues.apache.org/jira/browse/SPARK-11281 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.6.0 Environment: R 3.2.2, Spark build from master 487d409e71767c76399217a07af8de1bb0da7aa8 Reporter: Maciej Szymkiewicz It is not possible to to access Map field created from an environment. Assuming local data frame is created as follows: {code} ldf <- data.frame(row.names=1:2) ldf$x <- c(as.environment(list(a=1, b=2)), as.environment(list(c=3))) str(ldf) ## 'data.frame':2 obs. of 1 variable: ## $ x:List of 2 ## ..$ : ## ..$ : get("a", ldf$x[[1]]) ## [1] 1 get("c", ldf$x[[2]]) ## [1] 3 {code} It is possible to create a Spark data frame: {code} sdf <- createDataFrame(sqlContext, ldf) printSchema(sdf) ## root ## |-- x: array (nullable = true) ## ||-- element: map (containsNull = true) ## |||-- key: string ## |||-- value: double (valueContainsNull = true) {code} but it throws: {code} java.lang.IllegalArgumentException: Invalid array type e {code} on collect / head. Problem seems to be specific to environments and cannot be reproduced when Map comes for example from Cassandra table. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11167) Incorrect type resolution on heterogeneous data structures
[ https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maciej Szymkiewicz updated SPARK-11167: --- Comment: was deleted (was: Related problem: https://issues.apache.org/jira/browse/SPARK-11281 ) > Incorrect type resolution on heterogeneous data structures > -- > > Key: SPARK-11167 > URL: https://issues.apache.org/jira/browse/SPARK-11167 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.6.0 >Reporter: Maciej Szymkiewicz > > If structure contains heterogeneous incorrectly assigns type of the > encountered element as type of a whole structure. This problem affects both > lists: > {code} > SparkR:::infer_type(list(a=1, b="a") > ## [1] "array" > SparkR:::infer_type(list(a="a", b=1)) > ## [1] "array" > {code} > and environments: > {code} > SparkR:::infer_type(as.environment(list(a=1, b="a"))) > ## [1] "map" > SparkR:::infer_type(as.environment(list(a="a", b=1))) > ## [1] "map " > {code} > This results in errors during data collection and other operations on > DataFrames: > {code} > ldf <- data.frame(row.names=1:2) > ldf$foo <- list(list("1", 2), list(3, 4)) > sdf <- createDataFrame(sqlContext, ldf) > collect(sdf) > ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID > 9) > ## scala.MatchError: 2.0 (of class java.lang.Double) > ## ... > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled
[ https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated SPARK-6270: -- Affects Version/s: 1.5.1 > Standalone Master hangs when streaming job completes and event logging is > enabled > - > > Key: SPARK-6270 > URL: https://issues.apache.org/jira/browse/SPARK-6270 > Project: Spark > Issue Type: Bug > Components: Deploy, Streaming >Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1 >Reporter: Tathagata Das >Priority: Critical > > If the event logging is enabled, the Spark Standalone Master tries to > recreate the web UI of a completed Spark application from its event logs. > However if this event log is huge (e.g. for a Spark Streaming application), > then the master hangs in its attempt to read and recreate the web ui. This > hang causes the whole standalone cluster to be unusable. > Workaround is to disable the event logging. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11284) ALS produces predictions as floats and should be double
Dominik Dahlem created SPARK-11284: -- Summary: ALS produces predictions as floats and should be double Key: SPARK-11284 URL: https://issues.apache.org/jira/browse/SPARK-11284 Project: Spark Issue Type: Bug Components: ML Affects Versions: 1.5.1 Environment: All Reporter: Dominik Dahlem Using pyspark.ml and DataFrames, The ALS recommender cannot be evaluated using the RegressionEvaluator, because of a type mis-match between the model transformation and the evaluation APIs. One can work around this by casting the prediction column into double before passing it into the evaluator. However, this does not work with pipelines and cross validation. Code and traceback below: {code} als = ALS(rank=10, maxIter=30, regParam=0.1, userCol='userID', itemCol='movieID', ratingCol='rating') model = als.fit(training) predictions = model.transform(validation) evaluator = RegressionEvaluator(predictionCol='prediction', labelCol='rating') validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 'rmse'}) {code} Traceback: validationRmse = evaluator.evaluate(predictions, {evaluator.metricName: 'rmse'}) File "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py", line 63, in evaluate File "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/pyspark.zip/pyspark/ml/evaluation.py", line 94, in _evaluate File "/Users/dominikdahlem/software/spark-1.6.0-SNAPSHOT-bin-custom-spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__ File "/Users/dominikdahlem/projects/repositories/spark/python/pyspark/sql/utils.py", line 42, in deco raise IllegalArgumentException(s.split(': ', 1)[1]) pyspark.sql.utils.IllegalArgumentException: requirement failed: Column prediction must be of type DoubleType but was actually FloatType. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10562) .partitionBy() creates the metastore partition columns in all lowercase, but persists the data path as MixedCase resulting in an error when the data is later attempted
[ https://issues.apache.org/jira/browse/SPARK-10562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971042#comment-14971042 ] Apache Spark commented on SPARK-10562: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9251 > .partitionBy() creates the metastore partition columns in all lowercase, but > persists the data path as MixedCase resulting in an error when the data is > later attempted to query. > - > > Key: SPARK-10562 > URL: https://issues.apache.org/jira/browse/SPARK-10562 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Jason Pohl >Assignee: Wenchen Fan > Attachments: MixedCasePartitionBy.dbc > > > When using DataFrame.write.partitionBy().saveAsTable() it creates the > partiton by columns in all lowercase in the meta-store. However, it writes > the data to the filesystem using mixed-case. > This causes an error when running a select against the table. > -- > from pyspark.sql import Row > # Create a data frame with mixed case column names > myRDD = sc.parallelize([Row(Name="John Terry", Goals=1, Year=2015), >Row(Name="Frank Lampard", Goals=15, Year=2012)]) > myDF = sqlContext.createDataFrame(myRDD) > # Write this data out to a parquet file and partition by the Year (which is a > mixedCase name) > myDF.write.partitionBy("Year").saveAsTable("chelsea_goals") > %sql show create table chelsea_goals; > --The metastore is showwing a partition column name of all lowercase "year" > # Verify that the data is written with appropriate partitions > display(dbutils.fs.ls("/user/hive/warehouse/chelsea_goals")) > %sql > --Now try to run a query against this table > select * from chelsea_goals > Error in SQL statement: UncheckedExecutionException: > java.lang.RuntimeException: Partition column year not found in schema > StructType(StructField(Goals,LongType,true), > StructField(Name,StringType,true), StructField(Year,LongType,true)) > # Now lets try this again using a lowercase column name > myRDD2 = sc.parallelize([Row(Name="John Terry", Goals=1, year=2015), > Row(Name="Frank Lampard", Goals=15, year=2012)]) > myDF2 = sqlContext.createDataFrame(myRDD2) > myDF2.write.partitionBy("year").saveAsTable("chelsea_goals2") > %sql select * from chelsea_goals2; > --Now everything works -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10947) With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings
[ https://issues.apache.org/jira/browse/SPARK-10947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970777#comment-14970777 ] Apache Spark commented on SPARK-10947: -- User 'stephend-realitymine' has created a pull request for this issue: https://github.com/apache/spark/pull/9249 > With schema inference from JSON into a Dataframe, add option to infer all > primitive object types as strings > --- > > Key: SPARK-10947 > URL: https://issues.apache.org/jira/browse/SPARK-10947 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Ewan Leith >Priority: Minor > > Currently, when a schema is inferred from a JSON file using > sqlContext.read.json, the primitive object types are inferred as string, > long, boolean, etc. > However, if the inferred type is too specific (JSON obviously does not > enforce types itself), this causes issues with merging dataframe schemas. > Instead, we would like an option in the JSON inferField function to treat all > primitive objects as strings. > We'll create and submit a pull request for this for review. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7008) An implementation of Factorization Machine (LibFM)
[ https://issues.apache.org/jira/browse/SPARK-7008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970870#comment-14970870 ] Nick Pentreath commented on SPARK-7008: --- Is this now going in 1.6 (as per SPARK-10324)? If so is there a PR, since I cannot find one related. > An implementation of Factorization Machine (LibFM) > -- > > Key: SPARK-7008 > URL: https://issues.apache.org/jira/browse/SPARK-7008 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: zhengruifeng > Labels: features > Attachments: FM_CR.xlsx, FM_convergence_rate.xlsx, QQ20150421-1.png, > QQ20150421-2.png > > > An implementation of Factorization Machines based on Scala and Spark MLlib. > FM is a kind of machine learning algorithm for multi-linear regression, and > is widely used for recommendation. > FM works well in recent years' recommendation competitions. > Ref: > http://libfm.org/ > http://doi.acm.org/10.1145/2168752.2168771 > http://www.inf.uni-konstanz.de/~rendle/pdf/Rendle2010FM.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11278) PageRank fails with unified memory manager
Nishkam Ravi created SPARK-11278: Summary: PageRank fails with unified memory manager Key: SPARK-11278 URL: https://issues.apache.org/jira/browse/SPARK-11278 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.2, 1.6.0 Reporter: Nishkam Ravi PageRank (6-nodes, 32GB input) runs very slow and eventually fails with ExecutorLostFailure. Traced it back to the 'unified memory manager' commit from Oct 13th. Took a quick look at the code and couldn't see the problem (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to spot the problem quickly. Can be reproduced by running PageRank on a large enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11277) sort_array throws exception scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970694#comment-14970694 ] Apache Spark commented on SPARK-11277: -- User 'jliwork' has created a pull request for this issue: https://github.com/apache/spark/pull/9247 > sort_array throws exception scala.MatchError > > > Key: SPARK-11277 > URL: https://issues.apache.org/jira/browse/SPARK-11277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Jia Li >Priority: Minor > > I was trying out the sort_array function then hit this exception. > I looked into the spark source code. I found the root cause is that > sort_array does not check for an array of NULLs. It's not meaningful to sort > an array of entirely NULLs anyway. > I already have a fix for this issue and I'm going to create a pull request > for it. > scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show() > scala.MatchError: ArrayType(NullType,true) (of class > org.apache.spark.sql.types.ArrayType) > at > org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68) > at > org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67) > at > org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11277) sort_array throws exception scala.MatchError
[ https://issues.apache.org/jira/browse/SPARK-11277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11277: Assignee: (was: Apache Spark) > sort_array throws exception scala.MatchError > > > Key: SPARK-11277 > URL: https://issues.apache.org/jira/browse/SPARK-11277 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Linux >Reporter: Jia Li >Priority: Minor > > I was trying out the sort_array function then hit this exception. > I looked into the spark source code. I found the root cause is that > sort_array does not check for an array of NULLs. It's not meaningful to sort > an array of entirely NULLs anyway. > I already have a fix for this issue and I'm going to create a pull request > for it. > scala> sqlContext.sql("select sort_array(array(null, null)) from t1").show() > scala.MatchError: ArrayType(NullType,true) (of class > org.apache.spark.sql.types.ArrayType) > at > org.apache.spark.sql.catalyst.expressions.SortArray.lt$lzycompute(collectionOperations.scala:68) > at > org.apache.spark.sql.catalyst.expressions.SortArray.lt(collectionOperations.scala:67) > at > org.apache.spark.sql.catalyst.expressions.SortArray.nullSafeEval(collectionOperations.scala:111) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:341) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:440) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$9$$anonfun$applyOrElse$2.applyOrElse(Optimizer.scala:433) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] patcharee closed SPARK-11087. - Resolution: Not A Problem The predicate is indeed generated and can be found in the executor log > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.388 seconds, Fetched: 58 row(s) > > Data was inserted into this table by another spark job> >
[jira] [Commented] (SPARK-9265) Dataframe.limit joined with another dataframe can be non-deterministic
[ https://issues.apache.org/jira/browse/SPARK-9265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970801#comment-14970801 ] Yanbo Liang commented on SPARK-9265: I'm working on it. > Dataframe.limit joined with another dataframe can be non-deterministic > -- > > Key: SPARK-9265 > URL: https://issues.apache.org/jira/browse/SPARK-9265 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1 >Reporter: Tathagata Das >Priority: Critical > > {code} > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > val recentFailures = table("failed_suites").cache() > val topRecentFailures = > recentFailures.groupBy('suiteName).agg(count("*").as('failCount)).orderBy('failCount.desc).limit(10) > topRecentFailures.show(100) > val mot = topRecentFailures.as("a").join(recentFailures.as("b"), > $"a.suiteName" === $"b.suiteName") > > (1 to 10).foreach { i => > println(s"$i: " + mot.count()) > } > {code} > This shows. > {code} > ++-+ > | suiteName|failCount| > ++-+ > |org.apache.spark| 85| > |org.apache.spark| 26| > |org.apache.spark| 26| > |org.apache.spark| 17| > |org.apache.spark| 17| > |org.apache.spark| 15| > |org.apache.spark| 13| > |org.apache.spark| 13| > |org.apache.spark| 11| > |org.apache.spark|9| > ++-+ > 1: 174 > 2: 166 > 3: 174 > 4: 106 > 5: 158 > 6: 110 > 7: 174 > 8: 158 > 9: 166 > 10: 106 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11262) Unit test for gradient, loss layers, memory management for multilayer perceptron
[ https://issues.apache.org/jira/browse/SPARK-11262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11262: -- Target Version/s: (was: 1.6.0) Fix Version/s: (was: 1.5.1) [~avulanov] don't set Fix/Target version please > Unit test for gradient, loss layers, memory management for multilayer > perceptron > > > Key: SPARK-11262 > URL: https://issues.apache.org/jira/browse/SPARK-11262 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.5.1 >Reporter: Alexander Ulanov > Original Estimate: 168h > Remaining Estimate: 168h > > Multi-layer perceptron requires more rigorous tests and refactoring of layer > interfaces to accommodate development of new features. > 1)Implement unit test for gradient and loss > 2)Refactor the internal layer interface to extract "loss function" -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11267) NettyRpcEnv and sparkDriver services report the same port in the logs
[ https://issues.apache.org/jira/browse/SPARK-11267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11267: -- Component/s: Spark Core > NettyRpcEnv and sparkDriver services report the same port in the logs > - > > Key: SPARK-11267 > URL: https://issues.apache.org/jira/browse/SPARK-11267 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: the version built from today's sources - Spark version > 1.6.0-SNAPSHOT >Reporter: Jacek Laskowski >Priority: Minor > > When starting {{./bin/spark-shell --conf spark.driver.port=}} Spark > reports two services - NettyRpcEnv and sparkDriver - using the same {{}} > port: > {code} > 15/10/22 23:09:32 INFO SparkContext: Running Spark version 1.6.0-SNAPSHOT > 15/10/22 23:09:32 INFO SparkContext: Spark configuration: > spark.app.name=Spark shell > spark.driver.port= > spark.home=/Users/jacek/dev/oss/spark > spark.jars= > spark.logConf=true > spark.master=local[*] > spark.repl.class.uri=http://192.168.1.4:52645 > spark.submit.deployMode=client > ... > 15/10/22 23:09:33 INFO Utils: Successfully started service 'NettyRpcEnv' on > port . > ... > 15/10/22 23:09:33 INFO Utils: Successfully started service 'sparkDriver' on > port . > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11270) Add improved equality testing for TopicAndPartition from the Kafka Streaming API
[ https://issues.apache.org/jira/browse/SPARK-11270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11270: -- Target Version/s: (was: 1.5.1) Fix Version/s: (was: 1.5.1) [~manygrams] have a look at https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Among other things, don't set Target/Fix version. > Add improved equality testing for TopicAndPartition from the Kafka Streaming > API > > > Key: SPARK-11270 > URL: https://issues.apache.org/jira/browse/SPARK-11270 > Project: Spark > Issue Type: Improvement > Components: PySpark, Streaming >Affects Versions: 1.5.1 >Reporter: Nick Evans >Priority: Minor > > Hey, sorry, new to contributing to Spark! Let me know if I'm doing anything > wrong. > This issue is in relation to equality testing of a TopicAndPartition object. > It allows you to test that the topics and partitions of two of these objects > are equal, as opposed to checking that the two objects are the same instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11087) spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate
[ https://issues.apache.org/jira/browse/SPARK-11087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970786#comment-14970786 ] patcharee commented on SPARK-11087: --- [~zzhan] I found the predicate generated in the executor log for the case using dataframe (not hiveContext.sql). Sorry for my mistake, and thanks for your help! > spark.sql.orc.filterPushdown does not work, No ORC pushdown predicate > - > > Key: SPARK-11087 > URL: https://issues.apache.org/jira/browse/SPARK-11087 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: orc file version 0.12 with HIVE_8732 > hive version 1.2.1.2.3.0.0-2557 >Reporter: patcharee >Priority: Minor > > I have an external hive table stored as partitioned orc file (see the table > schema below). I tried to query from the table with where clause> > hiveContext.setConf("spark.sql.orc.filterPushdown", "true") > hiveContext.sql("select u, v from 4D where zone = 2 and x = 320 and y = > 117")). > But from the log file with debug logging level on, the ORC pushdown predicate > was not generated. > Unfortunately my table was not sorted when I inserted the data, but I > expected the ORC pushdown predicate should be generated (because of the where > clause) though > Table schema > > hive> describe formatted 4D; > OK > # col_namedata_type comment > > date int > hhint > x int > y int > heightfloat > u float > v float > w float > phfloat > phb float > t float > p float > pbfloat > qvaporfloat > qgraupfloat > qnice float > qnrainfloat > tke_pbl float > el_pblfloat > qcloudfloat > > # Partition Information > # col_namedata_type comment > > zone int > z int > year int > month int > > # Detailed Table Information > Database: default > Owner:patcharee > CreateTime: Thu Jul 09 16:46:54 CEST 2015 > LastAccessTime: UNKNOWN > Protect Mode: None > Retention:0 > Location: hdfs://helmhdfs/apps/hive/warehouse/wrf_tables/4D > > Table Type: EXTERNAL_TABLE > Table Parameters: > EXTERNALTRUE > comment this table is imported from rwf_data/*/wrf/* > last_modified_bypatcharee > last_modified_time 1439806692 > orc.compressZLIB > transient_lastDdlTime 1439806692 > > # Storage Information > SerDe Library:org.apache.hadoop.hive.ql.io.orc.OrcSerde > InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat > OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat > > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format1 > Time taken: 0.388 seconds, Fetched: 58 row(s) >
[jira] [Commented] (SPARK-11279) Add DataFrame#toDF in PySpark
[ https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970705#comment-14970705 ] Apache Spark commented on SPARK-11279: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/9248 > Add DataFrame#toDF in PySpark > - > > Key: SPARK-11279 > URL: https://issues.apache.org/jira/browse/SPARK-11279 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11259) Params.validateParams() should be called automatically
[ https://issues.apache.org/jira/browse/SPARK-11259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11259: Description: Params.validateParams() can not be called automatically currently. Such as the following code snippet will not throw exception which is not as expected. {code} val df = sqlContext.createDataFrame( Seq( (1, Vectors.dense(0.0, 1.0, 4.0), 1.0), (2, Vectors.dense(1.0, 0.0, 4.0), 2.0), (3, Vectors.dense(1.0, 0.0, 5.0), 3.0), (4, Vectors.dense(0.0, 0.0, 5.0), 4.0)) ).toDF("id", "features", "label") val scaler = new MinMaxScaler() .setInputCol("features") .setOutputCol("features_scaled") .setMin(10) .setMax(0) val pipeline = new Pipeline().setStages(Array(scaler)) pipeline.fit(df) {code} validateParams() should be called by PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to put it in transformSchema(). was: Params.validateParams() not be called automatically currently. Such as the following code snippet will not throw exception which is not as expected. {code} val df = sqlContext.createDataFrame( Seq( (1, Vectors.dense(0.0, 1.0, 4.0), 1.0), (2, Vectors.dense(1.0, 0.0, 4.0), 2.0), (3, Vectors.dense(1.0, 0.0, 5.0), 3.0), (4, Vectors.dense(0.0, 0.0, 5.0), 4.0)) ).toDF("id", "features", "label") val scaler = new MinMaxScaler() .setInputCol("features") .setOutputCol("features_scaled") .setMin(10) .setMax(0) val pipeline = new Pipeline().setStages(Array(scaler)) pipeline.fit(df) {code} validateParams() should be called by PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to put it in transformSchema(). > Params.validateParams() should be called automatically > -- > > Key: SPARK-11259 > URL: https://issues.apache.org/jira/browse/SPARK-11259 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > Params.validateParams() can not be called automatically currently. Such as > the following code snippet will not throw exception which is not as expected. > {code} > val df = sqlContext.createDataFrame( > Seq( > (1, Vectors.dense(0.0, 1.0, 4.0), 1.0), > (2, Vectors.dense(1.0, 0.0, 4.0), 2.0), > (3, Vectors.dense(1.0, 0.0, 5.0), 3.0), > (4, Vectors.dense(0.0, 0.0, 5.0), 4.0)) > ).toDF("id", "features", "label") > val scaler = new MinMaxScaler() > .setInputCol("features") > .setOutputCol("features_scaled") > .setMin(10) > .setMax(0) > val pipeline = new Pipeline().setStages(Array(scaler)) > pipeline.fit(df) > {code} > validateParams() should be called by > PipelineStage(Pipeline/Estimator/Transformer) automatically, so I propose to > put it in transformSchema(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970887#comment-14970887 ] Yanbo Liang commented on SPARK-6724: [~MeethuMathew] I will take over this task and send a PR, welcome to comment on my PR. > Model import/export for FPGrowth > > > Key: SPARK-6724 > URL: https://issues.apache.org/jira/browse/SPARK-6724 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11279) Add DataFrame#toDF in PySpark
[ https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11279: Assignee: (was: Apache Spark) > Add DataFrame#toDF in PySpark > - > > Key: SPARK-11279 > URL: https://issues.apache.org/jira/browse/SPARK-11279 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11279) Add DataFrame#toDF in PySpark
[ https://issues.apache.org/jira/browse/SPARK-11279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11279: Assignee: Apache Spark > Add DataFrame#toDF in PySpark > - > > Key: SPARK-11279 > URL: https://issues.apache.org/jira/browse/SPARK-11279 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11280) Mesos cluster deployment using only one node
Iulian Dragos created SPARK-11280: - Summary: Mesos cluster deployment using only one node Key: SPARK-11280 URL: https://issues.apache.org/jira/browse/SPARK-11280 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.5.1, 1.6.0 Reporter: Iulian Dragos I submit the SparkPi example in Mesos cluster mode, and I notice that all tasks fail except the ones that run on the same node as the driver. The others fail with {code} sh: 1: /tmp/mesos/slaves/1521e408-d8fe-416d-898b-3801e73a8293-S0/frameworks/1521e408-d8fe-416d-898b-3801e73a8293-0003/executors/driver-20151023113121-0006/runs/2abefd29-7386-4d81-a025-9d794780db23/spark-1.5.0-bin-hadoop2.6/bin/spark-class: not found {code} The path exists only on the machine that launched the driver, and the sandbox of the executor where this task died is completely empty. I launch the task like this: {code} $ spark-submit --deploy-mode cluster --master mesos://sagitarius.local:7077 --conf spark.executor.uri="ftp://sagitarius.local/ftp/spark-1.5.0-bin-hadoop2.6.tgz; --conf spark.mesos.coarse=true --class org.apache.spark.examples.SparkPi ftp://sagitarius.local/ftp/spark-examples-1.5.0-hadoop2.6.0.jar Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request to launch an application in mesos://sagitarius.local:7077. 15/10/23 11:31:21 INFO RestSubmissionClient: Submission successfully created as driver-20151023113121-0006. Polling submission state... 15/10/23 11:31:21 INFO RestSubmissionClient: Submitting a request for the status of submission driver-20151023113121-0006 in mesos://sagitarius.local:7077. 15/10/23 11:31:21 INFO RestSubmissionClient: State of driver driver-20151023113121-0006 is now QUEUED. 15/10/23 11:31:21 INFO RestSubmissionClient: Server responded with CreateSubmissionResponse: { "action" : "CreateSubmissionResponse", "serverSparkVersion" : "1.5.0", "submissionId" : "driver-20151023113121-0006", "success" : true } {code} I can see the driver in the Dispatcher UI and the job succeeds eventually, but running only on the node where the driver was launched (see attachment). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps
[ https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14970783#comment-14970783 ] Sean Owen commented on SPARK-11016: --- NB: the resolution here may be to simply remove usage of roaringbitmaps: https://github.com/apache/spark/pull/9243 > Spark fails when running with a task that requires a more recent version of > RoaringBitmaps > -- > > Key: SPARK-11016 > URL: https://issues.apache.org/jira/browse/SPARK-11016 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.0 >Reporter: Charles Allen > > The following error appears during Kryo init whenever a more recent version > (>0.5.0) of Roaring bitmaps is required by a job. > org/roaringbitmap/RoaringArray$Element was removed in 0.5.0 > {code} > A needed class was not found. This could be due to an error in your runpath. > Missing class: org/roaringbitmap/RoaringArray$Element > java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338) > at > org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala) > at > org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93) > at > org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237) > at > org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222) > at > org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138) > at > org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818) > at > org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.SparkContext.withScope(SparkContext.scala:700) > at org.apache.spark.SparkContext.textFile(SparkContext.scala:816) > {code} > See https://issues.apache.org/jira/browse/SPARK-5949 for related info -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-11229) NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0
[ https://issues.apache.org/jira/browse/SPARK-11229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reopened SPARK-11229: --- "Fixed" implies there was a change attached to this JIRA that resolved the issue, and we don't have that here. If it were probably resolved by another JIRA, "duplicate" would be appropriate. Otherwise, *shrug* doesn't really matter but "cannot reproduce" is maybe most accurate. > NPE in JoinedRow.isNullAt when spark.shuffle.memoryFraction=0 > - > > Key: SPARK-11229 > URL: https://issues.apache.org/jira/browse/SPARK-11229 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1 > Environment: 14.04.1-Ubuntu SMP x86_64 GNU/Linux >Reporter: Romi Kuntsman > > Steps to reproduce: > 1. set spark.shuffle.memoryFraction=0 > 2. load dataframe from parquet file > 3. see it's read correctly by calling dataframe.show() > 4. call dataframe.count() > Expected behaviour: > get count of rows in dataframe > OR, if memoryFraction=0 is an invalid setting, get notified about it > Actual behaviour: > CatalystReadSupport doesn't read the schema (even thought there is one) and > then there's a NullPointerException. > Driver stacktrace: > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697) > at scala.Option.foreach(Option.scala:236) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:1919) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:306) > at org.apache.spark.rdd.RDD.collect(RDD.scala:904) > at > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:177) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.DataFrame$$anonfun$collect$1.apply(DataFrame.scala:1385) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56) > at > org.apache.spark.sql.DataFrame.withNewExecutionId(DataFrame.scala:1903) > at org.apache.spark.sql.DataFrame.collect(DataFrame.scala:1384) > at org.apache.spark.sql.DataFrame.count(DataFrame.scala:1402) > ... 14 more > Caused by: java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.JoinedRow.isNullAt(JoinedRow.scala:70) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:194) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator$$anonfun$generateProcessRow$1.apply(TungstenAggregationIterator.scala:192) > at > org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.processInputs(TungstenAggregationIterator.scala:368) > at >
[jira] [Commented] (SPARK-4940) Support more evenly distributing cores for Mesos mode
[ https://issues.apache.org/jira/browse/SPARK-4940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971416#comment-14971416 ] Martin Tapp commented on SPARK-4940: No real workaround for now as we need the round-robin strategy. You can beef up the executor allocated memory to prevent OOM. Mesos features are catching up with Yarn on some front. But Mesos offers better docker support and is more general purpose for maximizing out cluster resources. > Support more evenly distributing cores for Mesos mode > - > > Key: SPARK-4940 > URL: https://issues.apache.org/jira/browse/SPARK-4940 > Project: Spark > Issue Type: Improvement > Components: Mesos >Reporter: Timothy Chen > Attachments: mesos-config-difference-3nodes-vs-2nodes.png > > > Currently in Coarse grain mode the spark scheduler simply takes all the > resources it can on each node, but can cause uneven distribution based on > resources available on each slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11286) Make Outbox stopped exception singleton
[ https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11286: Assignee: (was: Apache Spark) > Make Outbox stopped exception singleton > --- > > Key: SPARK-11286 > URL: https://issues.apache.org/jira/browse/SPARK-11286 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Priority: Trivial > > In two places in Outbox.scala , new SparkException is created for Outbox > stopped condition. > Create a singleton for Outbox stopped exception and use it instead of > creating exception every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11286) Make Outbox stopped exception singleton
[ https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11286: Assignee: Apache Spark > Make Outbox stopped exception singleton > --- > > Key: SPARK-11286 > URL: https://issues.apache.org/jira/browse/SPARK-11286 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Assignee: Apache Spark >Priority: Trivial > > In two places in Outbox.scala , new SparkException is created for Outbox > stopped condition. > Create a singleton for Outbox stopped exception and use it instead of > creating exception every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11286) Make Outbox stopped exception singleton
[ https://issues.apache.org/jira/browse/SPARK-11286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971307#comment-14971307 ] Apache Spark commented on SPARK-11286: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/9254 > Make Outbox stopped exception singleton > --- > > Key: SPARK-11286 > URL: https://issues.apache.org/jira/browse/SPARK-11286 > Project: Spark > Issue Type: Improvement >Reporter: Ted Yu >Priority: Trivial > > In two places in Outbox.scala , new SparkException is created for Outbox > stopped condition. > Create a singleton for Outbox stopped exception and use it instead of > creating exception every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971361#comment-14971361 ] Ryan Williams commented on SPARK-11162: --- Do you know how I might enable debug logging with -D flags? > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Malak updated SPARK-11278: -- Component/s: GraphX > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0 >Reporter: Nishkam Ravi > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11287) Executing deploy.client TestClient fails with bad class name
Bryan Cutler created SPARK-11287: Summary: Executing deploy.client TestClient fails with bad class name Key: SPARK-11287 URL: https://issues.apache.org/jira/browse/SPARK-11287 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.5.1 Reporter: Bryan Cutler Priority: Trivial Execution of deploy.client.TestClient creates an ApplicationDescription to start a TestExecutor which fails due to a bad class name. Currently it is "spark.deploy.client.TestExecutor" but should be "org.apache.spark.deploy.client.TestExecutor". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11287) Executing deploy.client TestClient fails with bad class name
[ https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11287: Assignee: (was: Apache Spark) > Executing deploy.client TestClient fails with bad class name > > > Key: SPARK-11287 > URL: https://issues.apache.org/jira/browse/SPARK-11287 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Bryan Cutler >Priority: Trivial > > Execution of deploy.client.TestClient creates an ApplicationDescription to > start a TestExecutor which fails due to a bad class name. > Currently it is "spark.deploy.client.TestExecutor" but should be > "org.apache.spark.deploy.client.TestExecutor". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11287) Executing deploy.client TestClient fails with bad class name
[ https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971454#comment-14971454 ] Apache Spark commented on SPARK-11287: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/9255 > Executing deploy.client TestClient fails with bad class name > > > Key: SPARK-11287 > URL: https://issues.apache.org/jira/browse/SPARK-11287 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Bryan Cutler >Priority: Trivial > > Execution of deploy.client.TestClient creates an ApplicationDescription to > start a TestExecutor which fails due to a bad class name. > Currently it is "spark.deploy.client.TestExecutor" but should be > "org.apache.spark.deploy.client.TestExecutor". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11287) Executing deploy.client TestClient fails with bad class name
[ https://issues.apache.org/jira/browse/SPARK-11287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11287: Assignee: Apache Spark > Executing deploy.client TestClient fails with bad class name > > > Key: SPARK-11287 > URL: https://issues.apache.org/jira/browse/SPARK-11287 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Trivial > > Execution of deploy.client.TestClient creates an ApplicationDescription to > start a TestExecutor which fails due to a bad class name. > Currently it is "spark.deploy.client.TestExecutor" but should be > "org.apache.spark.deploy.client.TestExecutor". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11286) Make Outbox stopped exception singleton
Ted Yu created SPARK-11286: -- Summary: Make Outbox stopped exception singleton Key: SPARK-11286 URL: https://issues.apache.org/jira/browse/SPARK-11286 Project: Spark Issue Type: Improvement Reporter: Ted Yu Priority: Trivial In two places in Outbox.scala , new SparkException is created for Outbox stopped condition. Create a singleton for Outbox stopped exception and use it instead of creating exception every time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10975) Shuffle files left behind on Mesos without dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-10975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971268#comment-14971268 ] Iulian Dragos commented on SPARK-10975: --- No, it's not a duplicate, but fixed by the same PR :) > Shuffle files left behind on Mesos without dynamic allocation > - > > Key: SPARK-10975 > URL: https://issues.apache.org/jira/browse/SPARK-10975 > Project: Spark > Issue Type: Bug > Components: Mesos >Affects Versions: 1.5.1 >Reporter: Iulian Dragos >Priority: Blocker > > (from mailing list) > Running on Mesos in coarse-grained mode. No dynamic allocation or shuffle > service. > I see that there are two types of temporary files under /tmp folder > associated with every executor: /tmp/spark- and /tmp/blockmgr-. > When job is finished /tmp/spark- is gone, but blockmgr directory is > left with all gigabytes in it. > The reason is that logic to clean up files is only enabled when the shuffle > service is running, see https://github.com/apache/spark/pull/7820 > The shuffle files should be placed in the Mesos sandbox or under `tmp/spark` > unless the shuffle service is enabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11258) Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory
[ https://issues.apache.org/jira/browse/SPARK-11258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Frank Rosner updated SPARK-11258: - Description: h4. Problem We tried to collect a DataFrame with > 1 million rows and a few hundred columns in SparkR. This took a huge amount of time (much more than in the Spark REPL). When looking into the code, I found that the {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and then {{.toArray}} which might cause the problem. h4. Solution Directly transpose the row wise representation to the column wise representation with one pass through the data. I will create a pull request for this. h4. Runtime comparison On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} method takes average 2267 ms to complete. My implementation takes only 554 ms on average. This effect might be due to garbage collection, especially if you consider that the old implementation didn't complete on an even bigger data frame. was: h4. Problem We tried to collect a DataFrame with > 1 million rows and a few hundred columns in SparkR. This took a huge amount of time (much more than in the Spark REPL). When looking into the code, I found that the {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and then {{.toArray}} which might cause the problem. h4. Solution Directly transpose the row wise representation to the column wise representation with one pass through the data. I will create a pull request for this. h4. Runtime comparison On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} method takes average 2267 ms to complete. My implementation takes only 554 ms on average. This effect gets even bigger, the more columns you have. > Converting a Spark DataFrame into an R data.frame is slow / requires a lot of > memory > > > Key: SPARK-11258 > URL: https://issues.apache.org/jira/browse/SPARK-11258 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Frank Rosner > > h4. Problem > We tried to collect a DataFrame with > 1 million rows and a few hundred > columns in SparkR. This took a huge amount of time (much more than in the > Spark REPL). When looking into the code, I found that the > {{org.apache.spark.sql.api.r.SQLUtils.dfToCols}} method does some map and > then {{.toArray}} which might cause the problem. > h4. Solution > Directly transpose the row wise representation to the column wise > representation with one pass through the data. I will create a pull request > for this. > h4. Runtime comparison > On a test data frame with 1 million rows and 22 columns, the old {{dfToCols}} > method takes average 2267 ms to complete. My implementation takes only 554 ms > on average. This effect might be due to garbage collection, especially if you > consider that the old implementation didn't complete on an even bigger data > frame. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line
[ https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971438#comment-14971438 ] Sean Owen commented on SPARK-11162: --- Hm, can you set logger levels with syntax like {{-Dlog4j.logger.com.foo=WARN}}? That's what I'm thinking of, at least. I know you can specify a config file this way. > Allow enabling debug logging from the command line > -- > > Key: SPARK-11162 > URL: https://issues.apache.org/jira/browse/SPARK-11162 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Ryan Williams >Priority: Minor > > Per [~vanzin] on [the user > list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html], > it would be nice if debug-logging could be enabled from the command line. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11278) PageRank fails with unified memory manager
[ https://issues.apache.org/jira/browse/SPARK-11278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971442#comment-14971442 ] Andrew Or commented on SPARK-11278: --- are there any exceptions in the executor logs? Does the problem go away if you run it again with `spark.memory.useLegacyMode = true`? > PageRank fails with unified memory manager > -- > > Key: SPARK-11278 > URL: https://issues.apache.org/jira/browse/SPARK-11278 > Project: Spark > Issue Type: Bug > Components: GraphX, Spark Core >Affects Versions: 1.5.2, 1.6.0 >Reporter: Nishkam Ravi > > PageRank (6-nodes, 32GB input) runs very slow and eventually fails with > ExecutorLostFailure. Traced it back to the 'unified memory manager' commit > from Oct 13th. Took a quick look at the code and couldn't see the problem > (changes look pretty good). cc'ing [~andrewor14][~vanzin] who may be able to > spot the problem quickly. Can be reproduced by running PageRank on a large > enough input dataset if needed. Sorry for not being of much help here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11288) Specify the return type for UDF in Scala
Davies Liu created SPARK-11288: -- Summary: Specify the return type for UDF in Scala Key: SPARK-11288 URL: https://issues.apache.org/jira/browse/SPARK-11288 Project: Spark Issue Type: New Feature Reporter: Davies Liu The return type is figured out from the function signature, maybe it's not that user want, for example, the default DecimalType is (38, 18), user may want (38, 0). The older deprecated one callUDF can do that, we should figure out a way to support that. cc [~marmbrus] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-11294: -- Assignee: Felix Cheung > Improve R doc for read.df, write.df, saveAsTable > > > Key: SPARK-11294 > URL: https://issues.apache.org/jira/browse/SPARK-11294 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Assignee: Felix Cheung >Priority: Minor > Fix For: 1.5.2, 1.6.0 > > > API doc lacks example and has several formatting issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-11294. --- Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9261 [https://github.com/apache/spark/pull/9261] > Improve R doc for read.df, write.df, saveAsTable > > > Key: SPARK-11294 > URL: https://issues.apache.org/jira/browse/SPARK-11294 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Priority: Minor > Fix For: 1.6.0, 1.5.2 > > > API doc lacks example and has several formatting issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11289) Substitute code examples in ML features with include_example
[ https://issues.apache.org/jira/browse/SPARK-11289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972345#comment-14972345 ] Xusen Yin commented on SPARK-11289: --- A feasible way to do it is create new example files in spark/examples, and move those code snippts from docs to there. > Substitute code examples in ML features with include_example > > > Key: SPARK-11289 > URL: https://issues.apache.org/jira/browse/SPARK-11289 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib >Reporter: Xusen Yin >Priority: Minor > > Substitute code examples with include_example. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable
Felix Cheung created SPARK-11294: Summary: Improve R doc for read.df, write.df, saveAsTable Key: SPARK-11294 URL: https://issues.apache.org/jira/browse/SPARK-11294 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.5.1 Reporter: Felix Cheung Priority: Minor API doc lacks example and has several formatting issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11255) R Test build should run on R 3.1.1
[ https://issues.apache.org/jira/browse/SPARK-11255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972418#comment-14972418 ] Sun Rui commented on SPARK-11255: - +1 for this request. Or we can update the supported R version to a newer version, but anyway, the version used in Jenkins should be same as the lowest version that claimed to be supported. > R Test build should run on R 3.1.1 > -- > > Key: SPARK-11255 > URL: https://issues.apache.org/jira/browse/SPARK-11255 > Project: Spark > Issue Type: Bug > Components: SparkR >Reporter: Felix Cheung >Priority: Minor > > Test should run on R 3.1.1 which is the version listed as supported. > Apparently there are few R changes that can go undetected since Jenkins Test > build is running something newer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11294: Assignee: (was: Apache Spark) > Improve R doc for read.df, write.df, saveAsTable > > > Key: SPARK-11294 > URL: https://issues.apache.org/jira/browse/SPARK-11294 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Priority: Minor > > API doc lacks example and has several formatting issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11294) Improve R doc for read.df, write.df, saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-11294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14972330#comment-14972330 ] Apache Spark commented on SPARK-11294: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/9261 > Improve R doc for read.df, write.df, saveAsTable > > > Key: SPARK-11294 > URL: https://issues.apache.org/jira/browse/SPARK-11294 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Felix Cheung >Priority: Minor > > API doc lacks example and has several formatting issues -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
[ https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-11125. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9134 [https://github.com/apache/spark/pull/9134] > Unreadable exception when running spark-sql without building with > -Phive-thriftserver and SPARK_PREPEND_CLASSES is set > -- > > Key: SPARK-11125 > URL: https://issues.apache.org/jira/browse/SPARK-11125 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Priority: Minor > Fix For: 1.6.0 > > > In development environment, building spark without -Phive-thriftserver and > SPARK_PREPEND_CLASSES is set. The following exception is thrown. > SparkSQLCliDriver can be loaded but hive related code could not be loaded. > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/cli/CliDriver > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:412) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.cli.CliDriver > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 21 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11125) Unreadable exception when running spark-sql without building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set
[ https://issues.apache.org/jira/browse/SPARK-11125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-11125: - Assignee: Jeff Zhang > Unreadable exception when running spark-sql without building with > -Phive-thriftserver and SPARK_PREPEND_CLASSES is set > -- > > Key: SPARK-11125 > URL: https://issues.apache.org/jira/browse/SPARK-11125 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.5.1 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Minor > Fix For: 1.6.0 > > > In development environment, building spark without -Phive-thriftserver and > SPARK_PREPEND_CLASSES is set. The following exception is thrown. > SparkSQLCliDriver can be loaded but hive related code could not be loaded. > {code} > Exception in thread "main" java.lang.NoClassDefFoundError: > org/apache/hadoop/hive/cli/CliDriver > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:800) > at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:449) > at java.net.URLClassLoader.access$100(URLClassLoader.java:71) > at java.net.URLClassLoader$1.run(URLClassLoader.java:361) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:412) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > at java.lang.Class.forName0(Native Method) > at java.lang.Class.forName(Class.java:270) > at org.apache.spark.util.Utils$.classForName(Utils.scala:173) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.lang.ClassNotFoundException: > org.apache.hadoop.hive.cli.CliDriver > at java.net.URLClassLoader$1.run(URLClassLoader.java:366) > at java.net.URLClassLoader$1.run(URLClassLoader.java:355) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:354) > at java.lang.ClassLoader.loadClass(ClassLoader.java:425) > at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) > at java.lang.ClassLoader.loadClass(ClassLoader.java:358) > ... 21 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10971) sparkR: RRunner should allow setting path to Rscript
[ https://issues.apache.org/jira/browse/SPARK-10971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-10971. --- Resolution: Fixed Fix Version/s: 1.5.2 1.6.0 Issue resolved by pull request 9179 [https://github.com/apache/spark/pull/9179] > sparkR: RRunner should allow setting path to Rscript > > > Key: SPARK-10971 > URL: https://issues.apache.org/jira/browse/SPARK-10971 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.1 >Reporter: Thomas Graves > Fix For: 1.6.0, 1.5.2 > > > I'm running spark on yarn and trying to use R in cluster mode. RRunner seems > to just call Rscript and assumes its in the path. But on our YARN deployment > R isn't installed on the nodes so it needs to be distributed along with the > job and we need the ability to point to where it gets installed. sparkR in > client mode has the config spark.sparkr.r.command to point to Rscript. > RRunner should have something similar so it works in cluster mode -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org