[jira] [Commented] (SPARK-8279) udf_round_3 test fails
[ https://issues.apache.org/jira/browse/SPARK-8279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585743#comment-14585743 ] Yijie Shen commented on SPARK-8279: --- Seems this has been fixed in master branch? udf_round_3 test fails -- Key: SPARK-8279 URL: https://issues.apache.org/jira/browse/SPARK-8279 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker query {code} select round(cast(negative(pow(2, 31)) as INT)), round(cast((pow(2, 31) - 1) as INT)), round(-32769), round(32768) from src tablesample (1 rows); {code} {code} [info] - udf_round_3 *** FAILED *** (4 seconds, 803 milliseconds) [info] Failed to execute query using catalyst: [info] Error: java.lang.Integer cannot be cast to java.lang.Double [info] java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.Double [info]at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119) [info]at org.apache.spark.sql.catalyst.expressions.BinaryMathExpression.eval(math.scala:86) [info]at org.apache.spark.sql.hive.HiveInspectors$class.toInspector(HiveInspectors.scala:628) [info]at org.apache.spark.sql.hive.HiveGenericUdf.toInspector(hiveUdfs.scala:148) [info]at org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$argumentInspectors$1.apply(hiveUdfs.scala:160) [info]at org.apache.spark.sql.hive.HiveGenericUdf$$anonfun$argumentInspectors$1.apply(hiveUdfs.scala:160) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.immutable.List.foreach(List.scala:318) [info]at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info]at scala.collection.AbstractTraversable.map(Traversable.scala:105) [info]at org.apache.spark.sql.hive.HiveGenericUdf.argumentInspectors$lzycompute(hiveUdfs.scala:160) [info]at org.apache.spark.sql.hive.HiveGenericUdf.argumentInspectors(hiveUdfs.scala:160) [info]at org.apache.spark.sql.hive.HiveGenericUdf.returnInspector$lzycompute(hiveUdfs.scala:164) [info]at org.apache.spark.sql.hive.HiveGenericUdf.returnInspector(hiveUdfs.scala:163) [info]at org.apache.spark.sql.hive.HiveGenericUdf.dataType$lzycompute(hiveUdfs.scala:180) [info]at org.apache.spark.sql.hive.HiveGenericUdf.dataType(hiveUdfs.scala:180) [info]at org.apache.spark.sql.catalyst.expressions.Cast.resolved$lzycompute(Cast.scala:31) [info]at org.apache.spark.sql.catalyst.expressions.Cast.resolved(Cast.scala:31) [info]at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121) [info]at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121) [info]at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70) [info]at scala.collection.immutable.List.forall(List.scala:84) [info]at org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:121) [info]at org.apache.spark.sql.catalyst.expressions.Expression.resolved$lzycompute(Expression.scala:109) [info]at org.apache.spark.sql.catalyst.expressions.Expression.resolved(Expression.scala:109) [info]at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121) [info]at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$childrenResolved$1.apply(Expression.scala:121) [info]at scala.collection.LinearSeqOptimized$class.forall(LinearSeqOptimized.scala:70) [info]at scala.collection.immutable.List.forall(List.scala:84) [info]at org.apache.spark.sql.catalyst.expressions.Expression.childrenResolved(Expression.scala:121) [info]at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$ConvertNaNs$$anonfun$apply$2$$anonfun$applyOrElse$2.applyOrElse(HiveTypeCoercion.scala:138) [info]at org.apache.spark.sql.catalyst.analysis.HiveTypeCoercion$ConvertNaNs$$anonfun$apply$2$$anonfun$applyOrElse$2.applyOrElse(HiveTypeCoercion.scala:136) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222) [info]at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info]at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:221) [info]at
[jira] [Assigned] (SPARK-8283) udf_struct test failure
[ https://issues.apache.org/jira/browse/SPARK-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8283: --- Assignee: Apache Spark udf_struct test failure --- Key: SPARK-8283 URL: https://issues.apache.org/jira/browse/SPARK-8283 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Blocker {code} [info] - udf_struct *** FAILED *** (704 milliseconds) [info] Failed to execute query using catalyst: [info] Error: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression [info] java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$1.apply(complexTypes.scala:64) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.immutable.List.foreach(List.scala:318) [info]at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info]at scala.collection.AbstractTraversable.map(Traversable.scala:105) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType$lzycompute(complexTypes.scala:64) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:61) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:55) [info]at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:43) [info]at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:353) [info]at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:340) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) [info]at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info]at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299) [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info]at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info]at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info]at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info]at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info]at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info]at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info]at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8283) udf_struct test failure
[ https://issues.apache.org/jira/browse/SPARK-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585661#comment-14585661 ] Apache Spark commented on SPARK-8283: - User 'yijieshen' has created a pull request for this issue: https://github.com/apache/spark/pull/6828 udf_struct test failure --- Key: SPARK-8283 URL: https://issues.apache.org/jira/browse/SPARK-8283 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker {code} [info] - udf_struct *** FAILED *** (704 milliseconds) [info] Failed to execute query using catalyst: [info] Error: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression [info] java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$1.apply(complexTypes.scala:64) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.immutable.List.foreach(List.scala:318) [info]at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info]at scala.collection.AbstractTraversable.map(Traversable.scala:105) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType$lzycompute(complexTypes.scala:64) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:61) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:55) [info]at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:43) [info]at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:353) [info]at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:340) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) [info]at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info]at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299) [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info]at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info]at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info]at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info]at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info]at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info]at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info]at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8283) udf_struct test failure
[ https://issues.apache.org/jira/browse/SPARK-8283?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8283: --- Assignee: (was: Apache Spark) udf_struct test failure --- Key: SPARK-8283 URL: https://issues.apache.org/jira/browse/SPARK-8283 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker {code} [info] - udf_struct *** FAILED *** (704 milliseconds) [info] Failed to execute query using catalyst: [info] Error: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression [info] java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Literal cannot be cast to org.apache.spark.sql.catalyst.expressions.NamedExpression [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$1.apply(complexTypes.scala:64) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) [info]at scala.collection.immutable.List.foreach(List.scala:318) [info]at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) [info]at scala.collection.AbstractTraversable.map(Traversable.scala:105) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType$lzycompute(complexTypes.scala:64) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:61) [info]at org.apache.spark.sql.catalyst.expressions.CreateStruct.dataType(complexTypes.scala:55) [info]at org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(ExtractValue.scala:43) [info]at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:353) [info]at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$8$$anonfun$applyOrElse$4.applyOrElse(Analyzer.scala:340) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286) [info]at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51) [info]at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:285) [info]at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:299) [info]at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) [info]at scala.collection.Iterator$class.foreach(Iterator.scala:727) [info]at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) [info]at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) [info]at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) [info]at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) [info]at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) [info]at scala.collection.AbstractIterator.to(Iterator.scala:1157) [info]at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API
sam created SPARK-8375: -- Summary: BinaryClassificationMetrics in ML Lib has odd API Key: SPARK-8375 URL: https://issues.apache.org/jira/browse/SPARK-8375 Project: Spark Issue Type: Bug Components: MLlib Reporter: sam According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`. In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC. I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets. In a Big Data context this does not make as the number of unique scores may be huge. Rather user should be able to either specify the number of buckets, or the number of data points in each bucket. E.g. `def roc(numPtsPerBucket: Int)` Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained. E.g. ``` case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) { // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic } ... def confusions(numPtsPerBucket: Int): RDD[Confusion] ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2898) Failed to connect to daemon
[ https://issues.apache.org/jira/browse/SPARK-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585809#comment-14585809 ] Peter Taylor commented on SPARK-2898: - FYI java.io.IOException: Cannot run program python: error=316, Unknown error: 316 I have seen this error to occur on mac because lib/jspawnhelper is missing execute permissions in your jre. Failed to connect to daemon --- Key: SPARK-2898 URL: https://issues.apache.org/jira/browse/SPARK-2898 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Davies Liu Assignee: Davies Liu Fix For: 1.1.0 There is a deadlock in handle_sigchld() because of logging Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer -Dspark.executor.memory=3g -Dspark.locality.wait=6000 Options: SchedulerThroughputTest --num-tasks=1 --num-trials=4 --inter-trial-wait=1 14/08/06 22:09:41 WARN JettyUtils: Failed to create UI on port 4040. Trying again on port 4041. - Failure(java.net.BindException: Address already in use) worker 50114 crashed abruptly with exit status 1 14/08/06 22:10:37 ERROR Executor: Exception in task 1476.0 in stage 1.0 (TID 11476) org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:150) at org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:154) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:101) ... 10 more 14/08/06 22:10:37 WARN PythonWorkerFactory: Failed to open socket to Python daemon: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at java.net.Socket.connect(Socket.java:528) at java.net.Socket.init(Socket.java:425) at java.net.Socket.init(Socket.java:241) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:68) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:82) at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:55) at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:101) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:66) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 14/08/06 22:10:37 ERROR Executor: Exception in task 1478.0 in stage 1.0 (TID 11478) java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonWorkerFactory.createSocket$1(PythonWorkerFactory.scala:69) at org.apache.spark.api.python.PythonWorkerFactory.liftedTree1$1(PythonWorkerFactory.scala:83) at
[jira] [Created] (SPARK-8374) Job frequently hangs after YARN preemption
Shay Rojansky created SPARK-8374: Summary: Job frequently hangs after YARN preemption Key: SPARK-8374 URL: https://issues.apache.org/jira/browse/SPARK-8374 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.4.0 Environment: YARN 2.7.0, Spark 1.4.0, Ubuntu 14.04 Reporter: Shay Rojansky Priority: Critical After upgrading to Spark 1.4.0, jobs that get preempted very frequently will not reacquire executors and will therefore hang. To reproduce: 1. I run Spark job A that acquires all grid resources 2. I run Spark job B in a higher-priority queue that acquires all grid resources. Job A is fully preempted. 3. Kill job B, releasing all resources 4. Job A should at this point reacquire all grid resources, but occasionally doesn't. Repeating the preemption scenario makes the bad behavior occur within a few attempts. (see logs at bottom). Note issue SPARK-7451 that was supposed to fix some Spark YARN preemption issues, maybe the work there is related to the new issues. The 1.4.0 preemption situation is considerably worse than 1.3.1 (we've downgraded to 1.3.1 just because of this issue). Logs -- When job B (the preemptor first acquires an application master, the following is logged by job A (the preemptee): {noformat} ERROR YarnScheduler: Lost executor 447 on g023.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 447 from TaskSet 0.0 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:54167] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. WARN TaskSetManager: Lost task 15.0 in stage 0.0 (TID 15, g023.grid.eaglerd.local): ExecutorLostFailure (executor 447 lost) INFO DAGScheduler: Executor lost: 447 (epoch 0) INFO BlockManagerMasterEndpoint: Trying to remove executor 447 from BlockManagerMaster. INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(447, g023.grid.eaglerd.local, 41406) INFO BlockManagerMaster: Removed 447 successfully in removeExecutor {noformat} (It's strange for errors/warnings to be logged for preemption) Later, when job B's AM starts requesting its resources, I get lots of the following in job A: {noformat} ERROR YarnScheduler: Lost executor 415 on g033.grid.eaglerd.local: remote Rpc client disassociated INFO TaskSetManager: Re-queueing tasks for 415 from TaskSet 0.0 WARN TaskSetManager: Lost task 231.0 in stage 0.0 (TID 231, g033.grid.eaglerd.local): ExecutorLostFailure (executor 415 lost) WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkexecu...@g023.grid.eaglerd.local:34357] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. {noformat} Finally, when I kill job B, job A emits lots of the following: {noformat} INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 31 WARN YarnClientSchedulerBackend: Executor to kill 31 does not exist! {noformat} And finally after some time: {noformat} WARN HeartbeatReceiver: Removing executor 466 with no recent heartbeats: 165964 ms exceeds timeout 12 ms ERROR YarnScheduler: Lost an executor 466 (already removed): Executor heartbeat timed out after 165964 ms {noformat} At this point the job never requests/acquires more resources and hangs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API
[ https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] sam updated SPARK-8375: --- Description: According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`. In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC. I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets. In a Big Data context this does not make sense as the number of unique scores may be huge. Rather user should be able to either specify the number of buckets, or the number of data points in each bucket. E.g. `def roc(numPtsPerBucket: Int)` Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained. E.g. ``` case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) { // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic } ... def confusions(numPtsPerBucket: Int): RDD[Confusion] ``` was: According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`. In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC. I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets. In a Big Data context this does not make as the number of unique scores may be huge. Rather user should be able to either specify the number of buckets, or the number of data points in each bucket. E.g. `def roc(numPtsPerBucket: Int)` Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained. E.g. ``` case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) { // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic } ... def confusions(numPtsPerBucket: Int): RDD[Confusion] ``` BinaryClassificationMetrics in ML Lib has odd API - Key: SPARK-8375 URL: https://issues.apache.org/jira/browse/SPARK-8375 Project: Spark Issue Type: Bug Components: MLlib Reporter: sam According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`. In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC. I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets. In a Big Data context this does not make sense as the number of unique scores may be huge. Rather user should be able to either specify the number of buckets, or the number of data points in each bucket. E.g. `def roc(numPtsPerBucket: Int)` Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained. E.g. ``` case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) { // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic } ... def confusions(numPtsPerBucket: Int): RDD[Confusion] ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs
Shixiong Zhu created SPARK-8376: --- Summary: Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs Key: SPARK-8376 URL: https://issues.apache.org/jira/browse/SPARK-8376 Project: Spark Issue Type: Bug Components: Documentation Reporter: Shixiong Zhu Priority: Minor Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since https://github.com/apache/spark/pull/5703. However, the docs has not yet updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD
[ https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585884#comment-14585884 ] Sean Owen commented on SPARK-8373: -- Really the same as SPARK-6878 https://github.com/apache/spark/commit/51b306b930cfe03ad21af72a3a6ef31e6e626235 When an RDD has no partition, Python sum will throw Can not reduce() empty RDD Key: SPARK-8373 URL: https://issues.apache.org/jira/browse/SPARK-8373 Project: Spark Issue Type: Bug Components: PySpark Reporter: Shixiong Zhu The issue is because sum uses reduce. Replacing it with fold will fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6666) org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585883#comment-14585883 ] Santiago M. Mola commented on SPARK-: - I opened SPARK-8377 to track the general case, since I have this problem with other data sources, not just JDBC. org.apache.spark.sql.jdbc.JDBCRDD does not escape/quote column names - Key: SPARK- URL: https://issues.apache.org/jira/browse/SPARK- Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0 Environment: Reporter: John Ferguson Priority: Critical Is there a way to have JDBC DataFrames use quoted/escaped column names? Right now, it looks like it sees the names correctly in the schema created but does not escape them in the SQL it creates when they are not compliant: org.apache.spark.sql.jdbc.JDBCRDD private val columnList: String = { val sb = new StringBuilder() columns.foreach(x = sb.append(,).append(x)) if (sb.length == 0) 1 else sb.substring(1) } If you see value in this, I would take a shot at adding the quoting (escaping) of column names here. If you don't do it, some drivers... like postgresql's will simply drop case all names when parsing the query. As you can see in the TL;DR below that means they won't match the schema I am given. TL;DR: I am able to connect to a Postgres database in the shell (with driver referenced): val jdbcDf = sqlContext.jdbc(jdbc:postgresql://localhost/sparkdemo?user=dbuser, sp500) In fact when I run: jdbcDf.registerTempTable(sp500) val avgEPSNamed = sqlContext.sql(SELECT AVG(`Earnings/Share`) as AvgCPI FROM sp500) and val avgEPSProg = jsonDf.agg(avg(jsonDf.col(Earnings/Share))) The values come back as expected. However, if I try: jdbcDf.show Or if I try val all = sqlContext.sql(SELECT * FROM sp500) all.show I get errors about column names not being found. In fact the error includes a mention of column names all lower cased. For now I will change my schema to be more restrictive. Right now it is, per a Stack Overflow poster, not ANSI compliant by doing things that are allowed by 's in pgsql, MySQL and SQLServer. BTW, our users are giving us tables like this... because various tools they already use support non-compliant names. In fact, this is mild compared to what we've had to support. Currently the schema in question uses mixed case, quoted names with special characters and spaces: CREATE TABLE sp500 ( Symbol text, Name text, Sector text, Price double precision, Dividend Yield double precision, Price/Earnings double precision, Earnings/Share double precision, Book Value double precision, 52 week low double precision, 52 week high double precision, Market Cap double precision, EBITDA double precision, Price/Sales double precision, Price/Book double precision, SEC Filings text ) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4644) Implement skewed join
[ https://issues.apache.org/jira/browse/SPARK-4644?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586076#comment-14586076 ] Nathan McCarthy commented on SPARK-4644: Something like this to make working with skewed data in spark easier would be very helpful Implement skewed join - Key: SPARK-4644 URL: https://issues.apache.org/jira/browse/SPARK-4644 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Shixiong Zhu Attachments: Skewed Join Design Doc.pdf Skewed data is not rare. For example, a book recommendation site may have several books which are liked by most of the users. Running ALS on such skewed data will raise a OutOfMemory error, if some book has too many users which cannot be fit into memory. To solve it, we propose a skewed join implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8377) Identifiers caseness information should be available at any time
Santiago M. Mola created SPARK-8377: --- Summary: Identifiers caseness information should be available at any time Key: SPARK-8377 URL: https://issues.apache.org/jira/browse/SPARK-8377 Project: Spark Issue Type: Improvement Components: SQL Reporter: Santiago M. Mola Currently, we have the option of having a case sensitive catalog or not. A case insensitive catalog just lowercases all identifiers. However, when pushing down to a data source, we lose the information about if an identifier should be case insensitive or strictly lowercase. Ideally, we would be able to distinguish a case insensitive identifier from a case sensitive one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8375) BinaryClassificationMetrics in ML Lib has odd API
[ https://issues.apache.org/jira/browse/SPARK-8375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-8375. -- Resolution: Invalid @sam This is a discussion for the mailing list rather than a JIRA. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark You're looking at an API from 4 versions ago, too. https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The input are scores and ground-truth labels. I agree with the problem of many distinct values, but, this is part of the newer API. BinaryClassificationMetrics in ML Lib has odd API - Key: SPARK-8375 URL: https://issues.apache.org/jira/browse/SPARK-8375 Project: Spark Issue Type: Bug Components: MLlib Reporter: sam According to https://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics The constructor takes `RDD[(Double, Double)]` which does not make sense it should be `RDD[(Double, T)]` or at least `RDD[(Double, Int)]`. In scikit I believe they use the number of unique scores to determine the number of thresholds and the ROC. I assume this is what BinaryClassificationMetrics is doing since it makes no mention of buckets. In a Big Data context this does not make sense as the number of unique scores may be huge. Rather user should be able to either specify the number of buckets, or the number of data points in each bucket. E.g. `def roc(numPtsPerBucket: Int)` Finally it would then be good if either the ROC output type was changed or another method was added that returned confusion matricies, so that the hard integer values can be obtained. E.g. ``` case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) { // bunch of methods for each of the things in the table here https://en.wikipedia.org/wiki/Receiver_operating_characteristic } ... def confusions(numPtsPerBucket: Int): RDD[Confusion] ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD
[ https://issues.apache.org/jira/browse/SPARK-8373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-8373: - Priority: Minor (was: Major) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD Key: SPARK-8373 URL: https://issues.apache.org/jira/browse/SPARK-8373 Project: Spark Issue Type: Bug Components: PySpark Reporter: Shixiong Zhu Priority: Minor The issue is because sum uses reduce. Replacing it with fold will fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8253) string function: ltrim
[ https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8253: --- Assignee: Apache Spark (was: Cheng Hao) string function: ltrim -- Key: SPARK-8253 URL: https://issues.apache.org/jira/browse/SPARK-8253 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark ltrim(string A): string Returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8260) string function: rtrim
[ https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585934#comment-14585934 ] Apache Spark commented on SPARK-8260: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6762 string function: rtrim -- Key: SPARK-8260 URL: https://issues.apache.org/jira/browse/SPARK-8260 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao rtrim(string A): string Returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8267) string function: trim
[ https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8267: --- Assignee: Apache Spark (was: Cheng Hao) string function: trim - Key: SPARK-8267 URL: https://issues.apache.org/jira/browse/SPARK-8267 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark trim(string A): string Returns the string resulting from trimming spaces from both ends of A. For example, trim(' foobar ') results in 'foobar' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8260) string function: rtrim
[ https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8260: --- Assignee: Apache Spark (was: Cheng Hao) string function: rtrim -- Key: SPARK-8260 URL: https://issues.apache.org/jira/browse/SPARK-8260 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark rtrim(string A): string Returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8267) string function: trim
[ https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8267: --- Assignee: Cheng Hao (was: Apache Spark) string function: trim - Key: SPARK-8267 URL: https://issues.apache.org/jira/browse/SPARK-8267 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao trim(string A): string Returns the string resulting from trimming spaces from both ends of A. For example, trim(' foobar ') results in 'foobar' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8253) string function: ltrim
[ https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585933#comment-14585933 ] Apache Spark commented on SPARK-8253: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6762 string function: ltrim -- Key: SPARK-8253 URL: https://issues.apache.org/jira/browse/SPARK-8253 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao ltrim(string A): string Returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8253) string function: ltrim
[ https://issues.apache.org/jira/browse/SPARK-8253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8253: --- Assignee: Cheng Hao (was: Apache Spark) string function: ltrim -- Key: SPARK-8253 URL: https://issues.apache.org/jira/browse/SPARK-8253 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao ltrim(string A): string Returns the string resulting from trimming spaces from the beginning(left hand side) of A. For example, ltrim(' foobar ') results in 'foobar '. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8267) string function: trim
[ https://issues.apache.org/jira/browse/SPARK-8267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585935#comment-14585935 ] Apache Spark commented on SPARK-8267: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/6762 string function: trim - Key: SPARK-8267 URL: https://issues.apache.org/jira/browse/SPARK-8267 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao trim(string A): string Returns the string resulting from trimming spaces from both ends of A. For example, trim(' foobar ') results in 'foobar' -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8260) string function: rtrim
[ https://issues.apache.org/jira/browse/SPARK-8260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8260: --- Assignee: Cheng Hao (was: Apache Spark) string function: rtrim -- Key: SPARK-8260 URL: https://issues.apache.org/jira/browse/SPARK-8260 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Cheng Hao rtrim(string A): string Returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs
[ https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585833#comment-14585833 ] Apache Spark commented on SPARK-8376: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6829 Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs Key: SPARK-8376 URL: https://issues.apache.org/jira/browse/SPARK-8376 Project: Spark Issue Type: Bug Components: Documentation Reporter: Shixiong Zhu Priority: Minor Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since https://github.com/apache/spark/pull/5703. However, the docs has not yet updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs
[ https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8376: --- Assignee: Apache Spark Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs Key: SPARK-8376 URL: https://issues.apache.org/jira/browse/SPARK-8376 Project: Spark Issue Type: Bug Components: Documentation Reporter: Shixiong Zhu Assignee: Apache Spark Priority: Minor Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since https://github.com/apache/spark/pull/5703. However, the docs has not yet updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8376) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs
[ https://issues.apache.org/jira/browse/SPARK-8376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8376: --- Assignee: (was: Apache Spark) Commons Lang 3 is one of the required JAR of Spark Flume Sink but is missing in the docs Key: SPARK-8376 URL: https://issues.apache.org/jira/browse/SPARK-8376 Project: Spark Issue Type: Bug Components: Documentation Reporter: Shixiong Zhu Priority: Minor Commons Lang 3 is added as one of the dependencies of Spark Flume Sink since https://github.com/apache/spark/pull/5703. However, the docs has not yet updated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8378) Add Spark Flume Python API
[ https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8378: --- Assignee: (was: Apache Spark) Add Spark Flume Python API -- Key: SPARK-8378 URL: https://issues.apache.org/jira/browse/SPARK-8378 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8378) Add Spark Flume Python API
[ https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8378: --- Assignee: Apache Spark Add Spark Flume Python API -- Key: SPARK-8378 URL: https://issues.apache.org/jira/browse/SPARK-8378 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-8350) R unit tests output should be logged to unit-tests.log
[ https://issues.apache.org/jira/browse/SPARK-8350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-8350. -- Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6807 [https://github.com/apache/spark/pull/6807] R unit tests output should be logged to unit-tests.log Key: SPARK-8350 URL: https://issues.apache.org/jira/browse/SPARK-8350 Project: Spark Issue Type: Bug Components: R Affects Versions: 1.4.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Minor Fix For: 1.5.0 Right now it's logged to R-unit-tests.log. Jenkins currently only archives files named unit-tests.log, and this is what all other modules (e.g. SQL, network, REPL) use. 1. We should be consistent 2. I don't want to reconfigure Jenkins to accept a different file -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5081) Shuffle write increases
[ https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Roi Reshef updated SPARK-5081: -- Comment: was deleted (was: Hi Guys, Was this issue already solved by any chance? I'm using Spark 1.3.1 for training algorithm with an iterative fashion. Since implementing a ranking measure (that ultimately uses sortBy) i'm experiencing similar problems. It seems that my cache explodes after ~100 iterations, and crushes the server with a There is insufficient memory for the Java Runtime Environment to continue message. Note that it isn't supposed to persist the sorted vectors nor to use them in the following iterations. So I wonder why memory consumption keeps growing with each iteration.) Shuffle write increases --- Key: SPARK-5081 URL: https://issues.apache.org/jira/browse/SPARK-5081 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.2.0 Reporter: Kevin Jung Priority: Critical Attachments: Spark_Debug.pdf, diff.txt The size of shuffle write showing in spark web UI is much different when I execute same spark job with same input data in both spark 1.1 and spark 1.2. At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB in spark 1.2. I set spark.shuffle.manager option to hash because it's default value is changed but spark 1.2 still writes shuffle output more than spark 1.1. It can increase disk I/O overhead exponentially as the input file gets bigger and it causes the jobs take more time to complete. In the case of about 100GB input, for example, the size of shuffle write is 39.7GB in spark 1.1 but 91.0GB in spark 1.2. spark 1.1 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |9|saveAsTextFile| |1169.4KB| | |12|combineByKey| |1265.4KB|1275.0KB| |6|sortByKey| |1276.5KB| | |8|mapPartitions| |91.0MB|1383.1KB| |4|apply| |89.4MB| | |5|sortBy|155.6MB| |98.1MB| |3|sortBy|155.6MB| | | |1|collect| |2.1MB| | |2|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | spark 1.2 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write|| |12|saveAsTextFile| |1170.2KB| | |11|combineByKey| |1264.5KB|1275.0KB| |8|sortByKey| |1273.6KB| | |7|mapPartitions| |134.5MB|1383.1KB| |5|zipWithIndex| |132.5MB| | |4|sortBy|155.6MB| |146.9MB| |3|sortBy|155.6MB| | | |2|collect| |2.0MB| | |1|mapValues|155.6MB| |2.2MB| |0|first|184.4KB| | | -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8378) Add Spark Flume Python API
[ https://issues.apache.org/jira/browse/SPARK-8378?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586144#comment-14586144 ] Apache Spark commented on SPARK-8378: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/6830 Add Spark Flume Python API -- Key: SPARK-8378 URL: https://issues.apache.org/jira/browse/SPARK-8378 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!
[ https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586248#comment-14586248 ] Sebastian Walz commented on SPARK-8335: --- Yeah I am sure, that is a really a scala.Double. I just looked it up again on github. So the problem still exists in on the current master branch. DecisionTreeModel.predict() return type not convenient! --- Key: SPARK-8335 URL: https://issues.apache.org/jira/browse/SPARK-8335 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Sebastian Walz Priority: Minor Labels: easyfix, machine_learning Original Estimate: 10m Remaining Estimate: 10m org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method: def predict(features: JavaRDD[Vector]): JavaRDD[Double] The problem here is the generic type of the return type JAVARDD[Double] because its a scala Double and I would expect a java.lang.Double. (to be convenient e.g. with org.apache.spark.mllib.classification.ClassificationModel) I wanted to extend the DecisionTreeModel and use it only for Binary Classification and wanted to implement the trait org.apache.spark.mllib.classification.ClassificationModel . But its not possible because the ClassificationModel already defines the predict method but with an return type JAVARDD[java.lang.Double]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8378) Add Spark Flume Python API
Shixiong Zhu created SPARK-8378: --- Summary: Add Spark Flume Python API Key: SPARK-8378 URL: https://issues.apache.org/jira/browse/SPARK-8378 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution
jeanlyn created SPARK-8379: -- Summary: LeaseExpiredException when using dynamic partition with speculative execution Key: SPARK-8379 URL: https://issues.apache.org/jira/browse/SPARK-8379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0, 1.3.1, 1.3.0 Reporter: jeanlyn when inserting to table using dynamic partitions with *spark.speculation=true* and there is a skew data of some partitions trigger the speculative tasks ,it will throws the exception like {code} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7104) Support model save/load in Python's Word2Vec
[ https://issues.apache.org/jira/browse/SPARK-7104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585488#comment-14585488 ] Yu Ishikawa commented on SPARK-7104: It would be nice to refactor Python's Word2Vec. And that would fit for another issue. Because we can call directly Scala's model API to have a good maintenanceability instead of {{Word2VecModelWrapper}}'s API, Support model save/load in Python's Word2Vec Key: SPARK-7104 URL: https://issues.apache.org/jira/browse/SPARK-7104 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7550) Support setting the right schema serde when writing to Hive metastore
[ https://issues.apache.org/jira/browse/SPARK-7550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585489#comment-14585489 ] Apache Spark commented on SPARK-7550: - User 'chenghao-intel' has created a pull request for this issue: https://github.com/apache/spark/pull/5733 Support setting the right schema serde when writing to Hive metastore --- Key: SPARK-7550 URL: https://issues.apache.org/jira/browse/SPARK-7550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.4.0 Reporter: Reynold Xin Assignee: Cheng Hao As of 1.4, Spark SQL does not properly set the table schema and serde when writing a table to Hive's metastore. Would be great to do that properly so users can use non-Spark SQL systems to read those tables. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8379: --- Assignee: Apache Spark LeaseExpiredException when using dynamic partition with speculative execution - Key: SPARK-8379 URL: https://issues.apache.org/jira/browse/SPARK-8379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: jeanlyn Assignee: Apache Spark when inserting to table using dynamic partitions with *spark.speculation=true* and there is a skew data of some partitions trigger the speculative tasks ,it will throws the exception like {code} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8379: --- Assignee: (was: Apache Spark) LeaseExpiredException when using dynamic partition with speculative execution - Key: SPARK-8379 URL: https://issues.apache.org/jira/browse/SPARK-8379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: jeanlyn when inserting to table using dynamic partitions with *spark.speculation=true* and there is a skew data of some partitions trigger the speculative tasks ,it will throws the exception like {code} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8379) LeaseExpiredException when using dynamic partition with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-8379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587322#comment-14587322 ] Apache Spark commented on SPARK-8379: - User 'jeanlyn' has created a pull request for this issue: https://github.com/apache/spark/pull/6833 LeaseExpiredException when using dynamic partition with speculative execution - Key: SPARK-8379 URL: https://issues.apache.org/jira/browse/SPARK-8379 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.0, 1.3.1, 1.4.0 Reporter: jeanlyn when inserting to table using dynamic partitions with *spark.speculation=true* and there is a skew data of some partitions trigger the speculative tasks ,it will throws the exception like {code} org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): Lease mismatch on /tmp/hive-jeanlyn/hive_2015-06-15_15-20-44_734_8801220787219172413-1/-ext-1/ds=2015-06-15/type=2/part-00301.lzo owned by DFSClient_attempt_201506031520_0011_m_000189_0_-1513487243_53 but is accessed by DFSClient_attempt_201506031520_0011_m_42_0_-1275047721_57 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8386) DataFrame and JDBC regression
Peter Haumer created SPARK-8386: --- Summary: DataFrame and JDBC regression Key: SPARK-8386 URL: https://issues.apache.org/jira/browse/SPARK-8386 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer Priority: Critical I have an ETL app that appends to a JDBC table new results found at each run. In 1.3.1 I did this: testResultsDF.insertIntoJDBC(CONNECTION_URL, TABLE_NAME, false); When I do this now in 1.4 it complains that the object 'TABLE_NAME' already exists. I get this even if I switch the overwrite to true. I also tried this now: testResultsDF.write().mode(SaveMode.Append).jdbc(CONNECTION_URL, TABLE_NAME, connectionProperties); getting the same error. It works running the first time creating the new table and adding data successfully. But, running it a second time it (the jdbc driver) will tell me that the table already exists. Even SaveMode.Overwrite will give me the same error. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6724) Model import/export for FPGrowth
[ https://issues.apache.org/jira/browse/SPARK-6724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14583139#comment-14583139 ] Hrishikesh edited comment on SPARK-6724 at 6/16/15 4:22 AM: [~josephkb], please assign this ticket to me. was (Author: hrishikesh91): [~josephkb], please assign this ticket to me. Model import/export for FPGrowth Key: SPARK-6724 URL: https://issues.apache.org/jira/browse/SPARK-6724 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Priority: Minor Note: experimental model API -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6932) A Prototype of Parameter Server
[ https://issues.apache.org/jira/browse/SPARK-6932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587299#comment-14587299 ] zhangyouhua commented on SPARK-6932: @Qiping Li in your idea the PS client run in slave node, but where the PS Server will run or deploy? A Prototype of Parameter Server --- Key: SPARK-6932 URL: https://issues.apache.org/jira/browse/SPARK-6932 Project: Spark Issue Type: New Feature Components: ML, MLlib, Spark Core Reporter: Qiping Li h2. Introduction As specified in [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590],it would be very helpful to integrate parameter server into Spark for machine learning algorithms, especially for those with ultra high dimensions features. After carefully studying the design doc of [Parameter Servers|https://docs.google.com/document/d/1SX3nkmF41wFXAAIr9BgqvrHSS5mW362fJ7roBXJm06o/edit?usp=sharing],and the paper of [Factorbird|http://stanford.edu/~rezab/papers/factorbird.pdf], we proposed a prototype of Parameter Server on Spark(Ps-on-Spark), with several key design concerns: * *User friendly interface* Careful investigation is done to most existing Parameter Server systems(including: [petuum|http://petuum.github.io], [parameter server|http://parameterserver.org], [paracel|https://github.com/douban/paracel]) and a user friendly interface is design by absorbing essence from all these system. * *Prototype of distributed array* IndexRDD (see [SPARK-4590|https://issues.apache.org/jira/browse/SPARK-4590]) doesn't seem to be a good option for distributed array, because in most case, the #key updates/second is not be very high. So we implement a distributed HashMap to store the parameters, which can be easily extended to get better performance. * *Minimal code change* Quite a lot of effort in done to avoid code change of Spark core. Tasks which need parameter server are still created and scheduled by Spark's scheduler. Tasks communicate with parameter server with a client object, through *akka* or *netty*. With all these concerns we propose the following architecture: h2. Architecture !https://cloud.githubusercontent.com/assets/1285855/7158179/f2d25cc4-e3a9-11e4-835e-89681596c478.jpg! Data is stored in RDD and is partitioned across workers. During each iteration, each worker gets parameters from parameter server then computes new parameters based on old parameters and data in the partition. Finally each worker updates parameters to parameter server.Worker communicates with parameter server through a parameter server client,which is initialized in `TaskContext` of this worker. The current implementation is based on YARN cluster mode, but it should not be a problem to transplanted it to other modes. h3. Interface We refer to existing parameter server systems(petuum, parameter server, paracel) when design the interface of parameter server. *`PSClient` provides the following interface for workers to use:* {code} // get parameter indexed by key from parameter server def get[T](key: String): T // get multiple parameters from parameter server def multiGet[T](keys: Array[String]): Array[T] // add parameter indexed by `key` by `delta`, // if multiple `delta` to update on the same parameter, // use `reduceFunc` to reduce these `delta`s frist. def update[T](key: String, delta: T, reduceFunc: (T, T) = T): Unit // update multiple parameters at the same time, use the same `reduceFunc`. def multiUpdate(keys: Array[String], delta: Array[T], reduceFunc: (T, T) = T: Unit // advance clock to indicate that current iteration is finished. def clock(): Unit // block until all workers have reached this line of code. def sync(): Unit {code} *`PSContext` provides following functions to use on driver:* {code} // load parameters from existing rdd. def loadPSModel[T](model: RDD[String, T]) // fetch parameters from parameter server to construct model. def fetchPSModel[T](keys: Array[String]): Array[T] {code} *A new function has been add to `RDD` to run parameter server tasks:* {code} // run the provided `func` on each partition of this RDD. // This function can use data of this partition(the first argument) // and a parameter server client(the second argument). // See the following Logistic Regression for an example. def runWithPS[U: ClassTag](func: (Array[T], PSClient) = U): Array[U] {code} h2. Example Here is an example of using our prototype to implement logistic regression: {code:title=LogisticRegression.scala|borderStyle=solid} def train( sc: SparkContext, input: RDD[LabeledPoint], numIterations: Int, stepSize: Double, miniBatchFraction: Double): LogisticRegressionModel = { //
[jira] [Updated] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] CHEN Zhiwei updated SPARK-8368: --- Description: After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. was: After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at
[jira] [Commented] (SPARK-8275) HistoryServer caches incomplete App UIs
[ https://issues.apache.org/jira/browse/SPARK-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587360#comment-14587360 ] Carson Wang commented on SPARK-8275: This seems to be the same issue to SPARK-7889 HistoryServer caches incomplete App UIs --- Key: SPARK-8275 URL: https://issues.apache.org/jira/browse/SPARK-8275 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.3.1 Reporter: Steve Loughran The history server caches applications retrieved from the {{ApplicationHistoryProvider.getAppUI()}} call for performance: it's expensive to rebuild. However, this cache also includes incomplete applications, as well as completed ones —and it never attempts to refresh the incomplete application. As a result, if you do a GET of the history of a running application, even after the application is finished, you'll still get the web UI/history as it was when that first GET was issued. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
[ https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8387: --- Assignee: (was: Apache Spark) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all - Key: SPARK-8387 URL: https://issues.apache.org/jira/browse/SPARK-8387 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.0 Reporter: SuYan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
[ https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587421#comment-14587421 ] Apache Spark commented on SPARK-8387: - User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/6834 [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all - Key: SPARK-8387 URL: https://issues.apache.org/jira/browse/SPARK-8387 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.0 Reporter: SuYan Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8385) java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation
Peter Haumer created SPARK-8385: --- Summary: java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation Key: SPARK-8385 URL: https://issues.apache.org/jira/browse/SPARK-8385 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 1.4.0 Environment: RHEL 7.1 Reporter: Peter Haumer I used to be able to debug my Spark apps in Eclipse. With Spark 1.3.1 I created a launch and just set the vm var -Dspark.master=local[4]. With 1.4 this stopped working when reading files from the OS filesystem. Running the same apps with spark-submit works fine. Loosing the ability to debug that way has a major impact on the usability of Spark. The following exception is thrown: Exception in thread main java.lang.UnsupportedOperationException: Not implemented by the TFS FileSystem implementation at org.apache.hadoop.fs.FileSystem.getScheme(FileSystem.java:213) at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2401) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2411) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2428) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:88) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2467) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2449) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:367) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:166) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:653) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:389) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:362) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.SparkContext$$anonfun$28.apply(SparkContext.scala:762) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:219) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:217) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:217) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1535) at org.apache.spark.rdd.RDD.reduce(RDD.scala:900) at org.apache.spark.api.java.JavaRDDLike$class.reduce(JavaRDDLike.scala:357) at org.apache.spark.api.java.AbstractJavaRDDLike.reduce(JavaRDDLike.scala:46) at com.databricks.apps.logs.LogAnalyzer.main(LogAnalyzer.java:60) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8280) udf7 failed due to null vs nan semantics
[ https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8280: --- Assignee: Apache Spark udf7 failed due to null vs nan semantics Key: SPARK-8280 URL: https://issues.apache.org/jira/browse/SPARK-8280 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Blocker To execute {code} sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite {code} If we want to be consistent with Hive, we need to special case our log function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8281) udf_asin and udf_acos test failure
[ https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587462#comment-14587462 ] Apache Spark commented on SPARK-8281: - User 'yijieshen' has created a pull request for this issue: https://github.com/apache/spark/pull/6835 udf_asin and udf_acos test failure -- Key: SPARK-8281 URL: https://issues.apache.org/jira/browse/SPARK-8281 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker acos/asin in Hive returns NaN for not a number, whereas we always return null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8281) udf_asin and udf_acos test failure
[ https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8281: --- Assignee: (was: Apache Spark) udf_asin and udf_acos test failure -- Key: SPARK-8281 URL: https://issues.apache.org/jira/browse/SPARK-8281 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker acos/asin in Hive returns NaN for not a number, whereas we always return null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8280) udf7 failed due to null vs nan semantics
[ https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587461#comment-14587461 ] Apache Spark commented on SPARK-8280: - User 'yijieshen' has created a pull request for this issue: https://github.com/apache/spark/pull/6835 udf7 failed due to null vs nan semantics Key: SPARK-8280 URL: https://issues.apache.org/jira/browse/SPARK-8280 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker To execute {code} sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite {code} If we want to be consistent with Hive, we need to special case our log function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8280) udf7 failed due to null vs nan semantics
[ https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8280: --- Assignee: (was: Apache Spark) udf7 failed due to null vs nan semantics Key: SPARK-8280 URL: https://issues.apache.org/jira/browse/SPARK-8280 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker To execute {code} sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite {code} If we want to be consistent with Hive, we need to special case our log function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8281) udf_asin and udf_acos test failure
[ https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8281: --- Assignee: Apache Spark udf_asin and udf_acos test failure -- Key: SPARK-8281 URL: https://issues.apache.org/jira/browse/SPARK-8281 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Priority: Blocker acos/asin in Hive returns NaN for not a number, whereas we always return null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8280) udf7 failed due to null vs nan semantics
[ https://issues.apache.org/jira/browse/SPARK-8280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587335#comment-14587335 ] Yijie Shen commented on SPARK-8280: --- I'll take this udf7 failed due to null vs nan semantics Key: SPARK-8280 URL: https://issues.apache.org/jira/browse/SPARK-8280 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker To execute {code} sbt/sbt -Phive -Dspark.hive.whitelist=udf7.* hive/test-only org.apache.spark.sql.hive.execution.HiveCompatibilitySuite {code} If we want to be consistent with Hive, we need to special case our log function. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8281) udf_asin and udf_acos test failure
[ https://issues.apache.org/jira/browse/SPARK-8281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587333#comment-14587333 ] Yijie Shen commented on SPARK-8281: --- I'll take this udf_asin and udf_acos test failure -- Key: SPARK-8281 URL: https://issues.apache.org/jira/browse/SPARK-8281 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Priority: Blocker acos/asin in Hive returns NaN for not a number, whereas we always return null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7888) Be able to disable intercept in Linear Regression in ML package
[ https://issues.apache.org/jira/browse/SPARK-7888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai updated SPARK-7888: --- Assignee: holdenk Be able to disable intercept in Linear Regression in ML package --- Key: SPARK-7888 URL: https://issues.apache.org/jira/browse/SPARK-7888 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: holdenk -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8206) math function: round
[ https://issues.apache.org/jira/browse/SPARK-8206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587509#comment-14587509 ] Apache Spark commented on SPARK-8206: - User 'zhichao-li' has created a pull request for this issue: https://github.com/apache/spark/pull/6836 math function: round Key: SPARK-8206 URL: https://issues.apache.org/jira/browse/SPARK-8206 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: zhichao-li round(double a): double Returns the rounded BIGINT value of a. round(double a, INT d): double Returns a rounded to d decimal places. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7633) Streaming Logistic Regression- Python bindings
[ https://issues.apache.org/jira/browse/SPARK-7633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587286#comment-14587286 ] Mike Dusenberry commented on SPARK-7633: I can work on this one! Streaming Logistic Regression- Python bindings -- Key: SPARK-7633 URL: https://issues.apache.org/jira/browse/SPARK-7633 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Yanbo Liang Add Python API for StreamingLogisticRegressionWithSGD -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7674) R-like stats for ML models
[ https://issues.apache.org/jira/browse/SPARK-7674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587376#comment-14587376 ] holdenk commented on SPARK-7674: I'd love to help with this if thats cool :) R-like stats for ML models -- Key: SPARK-7674 URL: https://issues.apache.org/jira/browse/SPARK-7674 Project: Spark Issue Type: New Feature Components: ML Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for supporting ML model summaries and statistics, following the example of R's summary() and plot() functions. [Design doc|https://docs.google.com/document/d/1oswC_Neqlqn5ElPwodlDY4IkSaHAi0Bx6Guo_LvhHK8/edit?usp=sharing] From the design doc: {quote} R and its well-established packages provide extensive functionality for inspecting a model and its results. This inspection is critical to interpreting, debugging and improving models. R is arguably a gold standard for a statistics/ML library, so this doc largely attempts to imitate it. The challenge we face is supporting similar functionality, but on big (distributed) data. Data size makes both efficient computation and meaningful displays/summaries difficult. R model and result summaries generally take 2 forms: * summary(model): Display text with information about the model and results on data * plot(model): Display plots about the model and results We aim to provide both of these types of information. Visualization for the plottable results will not be supported in MLlib itself, but we can provide results in a form which can be plotted easily with other tools. {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8387) [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all
[ https://issues.apache.org/jira/browse/SPARK-8387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8387: --- Assignee: Apache Spark [SPARK][Web-UI] Only show 4096 bytes content for executor log instead all - Key: SPARK-8387 URL: https://issues.apache.org/jira/browse/SPARK-8387 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.4.0 Reporter: SuYan Assignee: Apache Spark Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587459#comment-14587459 ] Yin Huai commented on SPARK-8368: - @CHEN Zhiwei How was the application submitted? ClassNotFoundException in closure for map -- Key: SPARK-8368 URL: https://issues.apache.org/jira/browse/SPARK-8368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the project on Windows 7 and run in a spark standalone cluster(or local) mode on Centos 6.X. Reporter: CHEN Zhiwei After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-8368) ClassNotFoundException in closure for map
[ https://issues.apache.org/jira/browse/SPARK-8368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587459#comment-14587459 ] Yin Huai edited comment on SPARK-8368 at 6/16/15 4:35 AM: -- [~zwChan] How was the application submitted? was (Author: yhuai): @CHEN Zhiwei How was the application submitted? ClassNotFoundException in closure for map -- Key: SPARK-8368 URL: https://issues.apache.org/jira/browse/SPARK-8368 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Environment: Centos 6.5, java 1.7.0_67, scala 2.10.4. Build the project on Windows 7 and run in a spark standalone cluster(or local) mode on Centos 6.X. Reporter: CHEN Zhiwei After upgraded the cluster from spark 1.3.0 to 1.4.0(rc4), I encountered the following exception: ==begin exception {quote} Exception in thread main java.lang.ClassNotFoundException: com.yhd.ycache.magic.Model$$anonfun$9$$anonfun$10 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:278) at org.apache.spark.util.InnerClosureFinder$$anon$4.visitMethodInsn(ClosureCleaner.scala:455) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.accept(Unknown Source) at org.apache.spark.util.ClosureCleaner$.getInnerClosureClasses(ClosureCleaner.scala:101) at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:197) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:132) at org.apache.spark.SparkContext.clean(SparkContext.scala:1891) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:294) at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:293) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:148) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:109) at org.apache.spark.rdd.RDD.withScope(RDD.scala:286) at org.apache.spark.rdd.RDD.map(RDD.scala:293) at org.apache.spark.sql.DataFrame.map(DataFrame.scala:1210) at com.yhd.ycache.magic.Model$.main(SSExample.scala:239) at com.yhd.ycache.magic.Model.main(SSExample.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:664) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:169) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:111) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {quote} ===end exception=== I simplify the code that cause this issue, as following: ==begin code== {noformat} object Model extends Serializable{ def main(args: Array[String]) { val Array(sql) = args val sparkConf = new SparkConf().setAppName(Mode Example) val sc = new SparkContext(sparkConf) val hive = new HiveContext(sc) //get data by hive sql val rows = hive.sql(sql) val data = rows.map(r = { val arr = r.toSeq.toArray val label = 1.0 def fmap = ( input: Any ) = 1.0 val feature = arr.map(_=1.0) LabeledPoint(label, Vectors.dense(feature)) }) data.count() } } {noformat} =end code=== This code can run pretty well on spark-shell, but error when submit it to spark cluster (standalone or local mode). I try the same code on spark 1.3.0(local mode), and no exception is encountered. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7837) NPE when save as parquet in speculative tasks
[ https://issues.apache.org/jira/browse/SPARK-7837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587491#comment-14587491 ] Yin Huai commented on SPARK-7837: - Seems https://www.mail-archive.com/user@spark.apache.org/msg30327.html is about the same issue. NPE when save as parquet in speculative tasks - Key: SPARK-7837 URL: https://issues.apache.org/jira/browse/SPARK-7837 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Priority: Critical The query is like {{df.orderBy(...).saveAsTable(...)}}. When there is no partitioning columns and there is a skewed key, I found the following exception in speculative tasks. After these failures, seems we could not call {{SparkHadoopMapRedUtil.commitTask}} correctly. {code} java.lang.NullPointerException at parquet.hadoop.InternalParquetRecordWriter.flushRowGroupToStore(InternalParquetRecordWriter.java:146) at parquet.hadoop.InternalParquetRecordWriter.close(InternalParquetRecordWriter.java:112) at parquet.hadoop.ParquetRecordWriter.close(ParquetRecordWriter.java:73) at org.apache.spark.sql.parquet.ParquetOutputWriter.close(newParquet.scala:115) at org.apache.spark.sql.sources.DefaultWriterContainer.abortTask(commands.scala:385) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation.org$apache$spark$sql$sources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:150) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.sql.sources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:122) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63) at org.apache.spark.scheduler.Task.run(Task.scala:70) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8381: Description: This method CatalystTypeConverters.convertToCatalyst is slow, so for batch conversion we should be using converter produced by createToCatalystConverter. (was: This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter.) reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, so for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8381: --- Assignee: Apache Spark reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang Assignee: Apache Spark This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8381: --- Assignee: (was: Apache Spark) reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8382) Improve Analysis Unit test framework
Michael Armbrust created SPARK-8382: --- Summary: Improve Analysis Unit test framework Key: SPARK-8382 URL: https://issues.apache.org/jira/browse/SPARK-8382 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust We have some nice frameworks for doing various unit test {{checkAnswer}}, {{comparePlan}}, {{checkEvaluation}}, etc. However {{AnalysisSuite}} is kind of sloppy with each test using assertions in different ways. I'd like a function that looks something like the following: {code} def checkAnalysis( inputPlan: LogicalPlan, expectedPlan: LogicalPlan = null, caseInsensitiveOnly: Boolean = false, expectedErrors: Seq[String] = Nil) {code} This function should construct tests that check the Analyzer works as expected and provides useful error messages when any failures are encountered. We should then rewrite the existing tests and beef up our coverage here. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6583) Support aggregated function in order by
[ https://issues.apache.org/jira/browse/SPARK-6583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6583. - Resolution: Fixed Fix Version/s: 1.5.0 Issue resolved by pull request 6816 [https://github.com/apache/spark/pull/6816] Support aggregated function in order by --- Key: SPARK-6583 URL: https://issues.apache.org/jira/browse/SPARK-6583 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.0 Reporter: Yadong Qi Assignee: Yadong Qi Fix For: 1.5.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586276#comment-14586276 ] Daniel LaBar commented on SPARK-6220: - [~nchammas], I also need IAM support and [made a few changes to spark_ec2.py|https://github.com/dnlbrky/spark/commit/5d4a9c65728245dc501c2a7c479ca27b6f685bd8], including an {{--instance-profile-name}} option. These modifications let me successfully create security groups and the master/slaves without specifying an access key and secret, but I'm still having issues getting Hadoop/Yarn setup so it may require further changes. Please let me know if you have suggestions. This would be my first time contributing to an Apache project and I'm new to Spark/Python, so please forgive my greenness... Should I create another JIRA specifically to add instance profile support, or can I reference this JIRA when submitting a pull request? Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586330#comment-14586330 ] Apache Spark commented on SPARK-8381: - User 'lianhuiwang' has created a pull request for this issue: https://github.com/apache/spark/pull/6831 reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7721) Generate test coverage report from Python
[ https://issues.apache.org/jira/browse/SPARK-7721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586280#comment-14586280 ] Josh Rosen commented on SPARK-7721: --- We now have the Jenkins HTML publisher plugin installed, so we can now easily publish HTML reports from tools from coverage.py (https://wiki.jenkins-ci.org/display/JENKINS/HTML+Publisher+Plugin). I might give this a try on NewSparkPullRequestBuilder today. Generate test coverage report from Python - Key: SPARK-7721 URL: https://issues.apache.org/jira/browse/SPARK-7721 Project: Spark Issue Type: Test Components: PySpark, Tests Reporter: Reynold Xin Would be great to have test coverage report for Python. Compared with Scala, it is tricker to understand the coverage without coverage reports in Python because we employ both docstring tests and unit tests in test files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8380) SparkR mis-counts
Rick Moritz created SPARK-8380: -- Summary: SparkR mis-counts Key: SPARK-8380 URL: https://issues.apache.org/jira/browse/SPARK-8380 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Rick Moritz On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform count operations on the entirety of the dataset and get the correct value, as double checked against the same code in scala. When I start to add conditions or even do a simple partial ascending histogram, I get discrepancies. In particular, there are missing values in SparkR, and massively so: A top 6 count of a certain feature in my dataset results in an order of magnitude smaller numbers, than I get via scala. The following logic, which I consider equivalent is the basis for this report: counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) head(arrange(counts, desc(counts$count))) versus: val table = sql(SELECT col_name, count(col_name) as value from df group by col_name order by value desc) The first, in particular, is taken directly from the SparkR programming guide. Since summarize isn't documented from what I can see, I'd hope it does what the programming guide indicates. In that case this would be a pretty serious logic bug (no errors are thrown). Otherwise, there's the possibility of a lack of documentation and badly worded example in the guide being behind my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-8322) EC2 script not fully updated for 1.4.0 release
[ https://issues.apache.org/jira/browse/SPARK-8322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Smith closed SPARK-8322. - Thanks for making my first PR so painless guys. EC2 script not fully updated for 1.4.0 release -- Key: SPARK-8322 URL: https://issues.apache.org/jira/browse/SPARK-8322 Project: Spark Issue Type: Bug Components: EC2 Affects Versions: 1.4.0 Reporter: Mark Smith Assignee: Mark Smith Labels: easyfix Fix For: 1.4.1, 1.5.0 In the spark_ec2.py script, the 1.4.0 spark version hasn't been added to the VALID_SPARK_VERSIONS map or the SPARK_TACHYON_MAP, causing the script to break for the latest release. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8380) SparkR mis-counts
[ https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586262#comment-14586262 ] Rick Moritz commented on SPARK-8380: I will attempt to reproduce this with an alternate dataset asap, but getting large volume datasets into this cluster is difficult. SparkR mis-counts - Key: SPARK-8380 URL: https://issues.apache.org/jira/browse/SPARK-8380 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Rick Moritz On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform count operations on the entirety of the dataset and get the correct value, as double checked against the same code in scala. When I start to add conditions or even do a simple partial ascending histogram, I get discrepancies. In particular, there are missing values in SparkR, and massively so: A top 6 count of a certain feature in my dataset results in an order of magnitude smaller numbers, than I get via scala. The following logic, which I consider equivalent is the basis for this report: counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) head(arrange(counts, desc(counts$count))) versus: val table = sql(SELECT col_name, count(col_name) as value from df group by col_name order by value desc) The first, in particular, is taken directly from the SparkR programming guide. Since summarize isn't documented from what I can see, I'd hope it does what the programming guide indicates. In that case this would be a pretty serious logic bug (no errors are thrown). Otherwise, there's the possibility of a lack of documentation and badly worded example in the guide being behind my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution
[ https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586266#comment-14586266 ] Igor Berman commented on SPARK-4879: I'm experiencing this issue. Sometimes rdd with 4 partitions is written with 3 parts and _SUCCESS marker is there. Missing output partitions after job completes with speculative execution Key: SPARK-4879 URL: https://issues.apache.org/jira/browse/SPARK-4879 Project: Spark Issue Type: Bug Components: Input/Output, Spark Core Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0 Reporter: Josh Rosen Assignee: Josh Rosen Priority: Critical Labels: backport-needed Fix For: 1.3.0 Attachments: speculation.txt, speculation2.txt When speculative execution is enabled ({{spark.speculation=true}}), jobs that save output files may report that they have completed successfully even though some output partitions written by speculative tasks may be missing. h3. Reproduction This symptom was reported to me by a Spark user and I've been doing my own investigation to try to come up with an in-house reproduction. I'm still working on a reliable local reproduction for this issue, which is a little tricky because Spark won't schedule speculated tasks on the same host as the original task, so you need an actual (or containerized) multi-host cluster to test speculation. Here's a simple reproduction of some of the symptoms on EC2, which can be run in {{spark-shell}} with {{--conf spark.speculation=true}}: {code} // Rig a job such that all but one of the tasks complete instantly // and one task runs for 20 seconds on its first attempt and instantly // on its second attempt: val numTasks = 100 sc.parallelize(1 to numTasks, numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) = if (ctx.partitionId == 0) { // If this is the one task that should run really slow if (ctx.attemptId == 0) { // If this is the first attempt, run slow Thread.sleep(20 * 1000) } } iter }.map(x = (x, x)).saveAsTextFile(/test4) {code} When I run this, I end up with a job that completes quickly (due to speculation) but reports failures from the speculated task: {code} [...] 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal (100/100) 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at console:22) finished in 0.856 s 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at console:22, took 0.885438374 s 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event for 70.1 in stage 3.0 because task 70 has already completed successfully scala 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): java.io.IOException: Failed to save output of task: attempt_201412110141_0003_m_49_413 org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160) org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172) org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132) org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991) org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) {code} One interesting thing to note about this stack trace: if we look at {{FileOutputCommitter.java:160}} ([link|http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-core/2.5.0-mr1-cdh5.2.0/org/apache/hadoop/mapred/FileOutputCommitter.java#160]), this point in the execution seems to correspond to a case where a task completes, attempts to commit its output, fails for some reason, then deletes the destination file, tries again, and fails: {code} if (fs.isFile(taskOutput)) { 152 Path finalOutputPath = getFinalPath(jobOutputDir, taskOutput, 153
[jira] [Commented] (SPARK-8380) SparkR mis-counts
[ https://issues.apache.org/jira/browse/SPARK-8380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586277#comment-14586277 ] Shivaram Venkataraman commented on SPARK-8380: -- [~RPCMoritz] Couple of things that would be interesting to see 1. Does the `sql` command in SparkR work correctly ? 2. Can you try the dataframe statements in Scala and see what results you get ? cc [~rxin] SparkR mis-counts - Key: SPARK-8380 URL: https://issues.apache.org/jira/browse/SPARK-8380 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Reporter: Rick Moritz On my dataset of ~9 Million rows x 30 columns, queried via Hive, I can perform count operations on the entirety of the dataset and get the correct value, as double checked against the same code in scala. When I start to add conditions or even do a simple partial ascending histogram, I get discrepancies. In particular, there are missing values in SparkR, and massively so: A top 6 count of a certain feature in my dataset results in an order of magnitude smaller numbers, than I get via scala. The following logic, which I consider equivalent is the basis for this report: counts-summarize(groupBy(df, df$col_name), count = n(tdf$col_name)) head(arrange(counts, desc(counts$count))) versus: val table = sql(SELECT col_name, count(col_name) as value from df group by col_name order by value desc) The first, in particular, is taken directly from the SparkR programming guide. Since summarize isn't documented from what I can see, I'd hope it does what the programming guide indicates. In that case this would be a pretty serious logic bug (no errors are thrown). Otherwise, there's the possibility of a lack of documentation and badly worded example in the guide being behind my misperception of SparkRs functionality. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to CatalystType
Lianhui Wang created SPARK-8381: --- Summary: reuse-typeConvert when convert Seq[Row] to CatalystType Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8381) reuse typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8381: Summary: reuse typeConvert when convert Seq[Row] to catalyst type (was: reuse-typeConvert when convert Seq[Row] to catalyst type) reuse typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8381) reuse-typeConvert when convert Seq[Row] to catalyst type
[ https://issues.apache.org/jira/browse/SPARK-8381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lianhui Wang updated SPARK-8381: Summary: reuse-typeConvert when convert Seq[Row] to catalyst type (was: reuse-typeConvert when convert Seq[Row] to CatalystType) reuse-typeConvert when convert Seq[Row] to catalyst type Key: SPARK-8381 URL: https://issues.apache.org/jira/browse/SPARK-8381 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Lianhui Wang This method CatalystTypeConverters.convertToCatalyst is slow, and for batch conversion we should be using converter produced by createToCatalystConverter. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586334#comment-14586334 ] Nicholas Chammas commented on SPARK-6220: - please forgive my greenness No need. Greenness is not a crime around these parts. :) I suggest creating a new JIRA for that specific feature. In the JIRA you can reference this issue here as related. By the way, I took a look at your commit. If I understood correctly, your change associates launched instances with an IAM profile (allowing the launched cluster to, for example, access S3 without credentials), but the machine you are running spark-ec2 from still needs AWS keys to launch them. That seems fine to me, but it doesn't sound exactly like what you intended from your comment. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed
[ https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586648#comment-14586648 ] Irina Easterling commented on SPARK-8383: - Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed Steps to reproduce: 1. Install Spark thru Ambari Wizard 2. After installation run the Spark Pi Example 3. Navigate to your Spark directory: baron1:~ # cd /usr/hdp/current/spark-client/ baron1:/usr/hdp/current/spark-client # su spark spark@baron1:/usr/hdp/current/spark-client spark-submit --verbose --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10 4. When job completed. 5. Access to the Ambari Spark Spark History Server UI 6. Click on 'Show incomplete applications' link 7. View the result for completed job //Results Last Updated column shows date/time as 1969/12/31 19:00:00 (screenshot attacked) 8. Verify that Spark job completed in YARN. (screenshot attached) There also discrepancy between SparkHistory Server WebUI and YARN/ResourceManager WebUI. Spark job completed and it is shown in the YARN/Resource Manager WebUI. In the SparkHistroyServer WebUI it shows as Uncompleted. See attached screenshots. Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed - Key: SPARK-8383 URL: https://issues.apache.org/jira/browse/SPARK-8383 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.3.1 Environment: Spark1.3.1.2.3 Reporter: Irina Easterling Attachments: Spark_WrongLastUpdatedDate.png, YARN_SparkJobCompleted.PNG Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed and Started Date is 2015/06/10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed
Irina Easterling created SPARK-8383: --- Summary: Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed Key: SPARK-8383 URL: https://issues.apache.org/jira/browse/SPARK-8383 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.3.1 Environment: Spark 1.3.1.2.3 Reporter: Irina Easterling Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed and Started Date is 2015/06/10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed
[ https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Easterling updated SPARK-8383: Attachment: Spark_WrongLastUpdatedDate.png Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed - Key: SPARK-8383 URL: https://issues.apache.org/jira/browse/SPARK-8383 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.3.1 Environment: Spark1.3.1.2.3 Reporter: Irina Easterling Attachments: Spark_WrongLastUpdatedDate.png Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed and Started Date is 2015/06/10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2
[ https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586616#comment-14586616 ] Daniel LaBar commented on SPARK-6220: - Ok, I'll create a new JIRA with a reference to this one. Thanks for checking the commit. Our IT security team only gives us AWS keys for a service account, but we don't have access to EC2, EMR, S3, etc. from this account. In order to do anything useful we have to switch roles using the service account credentials and MFA. But the Spark EC2 script doesn't seem to work with anything other than the AWS key/secret. So I use the service account credentials to create an EC2 instance with an IAM profile that can do useful things. I SSH into that EC2 instance, and then launch the EC2 Spark cluster from there using the modified spark_ec2.py script. Allow extended EC2 options to be passed through spark-ec2 - Key: SPARK-6220 URL: https://issues.apache.org/jira/browse/SPARK-6220 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor There are many EC2 options exposed by the boto library that spark-ec2 uses. Over time, many of these EC2 options have been bubbled up here and there to become spark-ec2 options. Examples: * spot prices * placement groups * VPC, subnet, and security group assignments It's likely that more and more EC2 options will trickle up like this to become spark-ec2 options. While major options are well suited to this type of promotion, we should probably allow users to pass through EC2 options they want to use through spark-ec2 in some generic way. Let's add two options: * {{--ec2-instance-option}} - [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run] * {{--ec2-spot-instance-option}} - [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances] Each option can be specified multiple times and is simply passed directly to the underlying boto call. For example: {code} spark-ec2 \ ... --ec2-instance-option instance_initiated_shutdown_behavior=terminate \ --ec2-instance-option ebs_optimized=True {code} I'm not sure about the exact syntax of the extended options, but something like this will do the trick as long as it can be made to pass the options correctly to boto in most cases. I followed the example of {{ssh}}, which supports multiple extended options similarly. {code} ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8383) Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed
[ https://issues.apache.org/jira/browse/SPARK-8383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Irina Easterling updated SPARK-8383: Attachment: YARN_SparkJobCompleted.PNG Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed - Key: SPARK-8383 URL: https://issues.apache.org/jira/browse/SPARK-8383 Project: Spark Issue Type: Bug Components: Spark Core, Web UI Affects Versions: 1.3.1 Environment: Spark1.3.1.2.3 Reporter: Irina Easterling Attachments: Spark_WrongLastUpdatedDate.png, YARN_SparkJobCompleted.PNG Spark History Server shows Last Updated as 1969/12/31 when SparkPI application completed and Started Date is 2015/06/10 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5680) Sum function on all null values, should return zero
[ https://issues.apache.org/jira/browse/SPARK-5680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587473#comment-14587473 ] Venkata Ramana G commented on SPARK-5680: - Holman, You are right that column with all NULL values should return NULL. As my motivation was to fix udaf_number_format.q, select sum('a') from src returns 0 in hive, mysql. and select cast('a' as double) from src returned NULL in hive. I assumed or rather wrongly analysed it as Sum of ALL NULLs return 0 and this has introduced the problem. I apologize for this and will submit the patch to revert that fix. select sum('a') from src returning 0 in hive and mysql created this confusion, is still not clear. Sum function on all null values, should return zero --- Key: SPARK-5680 URL: https://issues.apache.org/jira/browse/SPARK-5680 Project: Spark Issue Type: Bug Components: SQL Reporter: Venkata Ramana G Assignee: Venkata Ramana G Priority: Minor Fix For: 1.3.1, 1.4.0 SELECT sum('a'), avg('a'), variance('a'), std('a') FROM src; Current output: NULL NULLNULLNULL Expected output: 0.0 NULLNULLNULL This fixes hive udaf_number_format.q -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8335) DecisionTreeModel.predict() return type not convenient!
[ https://issues.apache.org/jira/browse/SPARK-8335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14587013#comment-14587013 ] Sean Owen commented on SPARK-8335: -- Go ahead and propose a PR. The sticky issue here is whether it's ok to change an experimental API at this point. I think so. DecisionTreeModel.predict() return type not convenient! --- Key: SPARK-8335 URL: https://issues.apache.org/jira/browse/SPARK-8335 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.3.1 Reporter: Sebastian Walz Priority: Minor Labels: easyfix, machine_learning Original Estimate: 10m Remaining Estimate: 10m org.apache.spark.mllib.tree.model.DecisionTreeModel has a predict method: def predict(features: JavaRDD[Vector]): JavaRDD[Double] The problem here is the generic type of the return type JAVARDD[Double] because its a scala Double and I would expect a java.lang.Double. (to be convenient e.g. with org.apache.spark.mllib.classification.ClassificationModel) I wanted to extend the DecisionTreeModel and use it only for Binary Classification and wanted to implement the trait org.apache.spark.mllib.classification.ClassificationModel . But its not possible because the ClassificationModel already defines the predict method but with an return type JAVARDD[java.lang.Double]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8370) Add API for data sources to register databases
Santiago M. Mola created SPARK-8370: --- Summary: Add API for data sources to register databases Key: SPARK-8370 URL: https://issues.apache.org/jira/browse/SPARK-8370 Project: Spark Issue Type: New Feature Reporter: Santiago M. Mola This API would allow to register a database with a data source instead of just a table. Registering a data source database would register all its table and maintain the catalog updated. The catalog could delegate to the data source lookups of tables for a database registered with this API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8371) improve unit test for MaxOf and MinOf
Wenchen Fan created SPARK-8371: -- Summary: improve unit test for MaxOf and MinOf Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8370) Add API for data sources to register databases
[ https://issues.apache.org/jira/browse/SPARK-8370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Santiago M. Mola updated SPARK-8370: Component/s: SQL Add API for data sources to register databases -- Key: SPARK-8370 URL: https://issues.apache.org/jira/browse/SPARK-8370 Project: Spark Issue Type: New Feature Components: SQL Reporter: Santiago M. Mola This API would allow to register a database with a data source instead of just a table. Registering a data source database would register all its table and maintain the catalog updated. The catalog could delegate to the data source lookups of tables for a database registered with this API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8348) Add in operator to DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585613#comment-14585613 ] Apache Spark commented on SPARK-8348: - User 'yu-iskw' has created a pull request for this issue: https://github.com/apache/spark/pull/6824 Add in operator to DataFrame Column --- Key: SPARK-8348 URL: https://issues.apache.org/jira/browse/SPARK-8348 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Xiangrui Meng It is convenient to add in operator to column, so we can filter values in a set. {code} df.filter(col(brand).in(dell, sony)) {code} In R, the operator should be `%in%`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8348) Add in operator to DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8348: --- Assignee: (was: Apache Spark) Add in operator to DataFrame Column --- Key: SPARK-8348 URL: https://issues.apache.org/jira/browse/SPARK-8348 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Xiangrui Meng It is convenient to add in operator to column, so we can filter values in a set. {code} df.filter(col(brand).in(dell, sony)) {code} In R, the operator should be `%in%`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8348) Add in operator to DataFrame Column
[ https://issues.apache.org/jira/browse/SPARK-8348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8348: --- Assignee: Apache Spark Add in operator to DataFrame Column --- Key: SPARK-8348 URL: https://issues.apache.org/jira/browse/SPARK-8348 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Xiangrui Meng Assignee: Apache Spark It is convenient to add in operator to column, so we can filter values in a set. {code} df.filter(col(brand).in(dell, sony)) {code} In R, the operator should be `%in%`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8371: --- Assignee: Apache Spark improve unit test for MaxOf and MinOf - Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14585615#comment-14585615 ] Apache Spark commented on SPARK-8371: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/6825 improve unit test for MaxOf and MinOf - Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-8371) improve unit test for MaxOf and MinOf
[ https://issues.apache.org/jira/browse/SPARK-8371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-8371: --- Assignee: (was: Apache Spark) improve unit test for MaxOf and MinOf - Key: SPARK-8371 URL: https://issues.apache.org/jira/browse/SPARK-8371 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8372) History server shows incorrect information for application not started
Carson Wang created SPARK-8372: -- Summary: History server shows incorrect information for application not started Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Priority: Minor The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is complemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-8373) When an RDD has no partition, Python sum will throw Can not reduce() empty RDD
Shixiong Zhu created SPARK-8373: --- Summary: When an RDD has no partition, Python sum will throw Can not reduce() empty RDD Key: SPARK-8373 URL: https://issues.apache.org/jira/browse/SPARK-8373 Project: Spark Issue Type: Bug Components: PySpark Reporter: Shixiong Zhu The issue is because sum uses reduce. Replacing it with fold will fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8372) History server shows incorrect information for application not started
[ https://issues.apache.org/jira/browse/SPARK-8372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carson Wang updated SPARK-8372: --- Attachment: IncorrectAppInfo.png History server shows incorrect information for application not started -- Key: SPARK-8372 URL: https://issues.apache.org/jira/browse/SPARK-8372 Project: Spark Issue Type: Bug Components: Deploy, Web UI Affects Versions: 1.4.0 Reporter: Carson Wang Priority: Minor Attachments: IncorrectAppInfo.png The history server may show an incorrect App ID for an incomplete application like App ID.inprogress. This app info will never disappear even after the app is complemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org