[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516852#comment-14516852 ] Sean Owen commented on SPARK-7189: -- Hm, I'd swear we had discussed this already and there was a good reason for it from [~vanzin], but I can't find the PR or JIRA now. I remember a PR changing the = to and the result was that it was on purpose. Not sure if this was a helpful comment but I do remember something like this. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516801#comment-14516801 ] Littlestar commented on SPARK-7193: --- {noformat} 15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1 Spark context available as sc. 15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:23 scala distData.reduce(_+_) --- org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 17, hpblade06): ExecutorLostFailure (executor 20150427-165835-1214949568-5050-6-S0 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} Spark on Mesos may need more tests for spark 1.3.1 release Key: SPARK-7193 URL: https://issues.apache.org/jira/browse/SPARK-7193 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.3.1 Reporter: Littlestar Spark on Mesos may need more tests for spark 1.3.1 release http://spark.apache.org/docs/latest/running-on-mesos.html I tested mesos 0.21.1/0.22.0/0.22.1 RC4. It just work well with ./bin/spark-shell --master mesos://host:5050. Any task need more than one nodes, it will throws the following exceptions. {noformat} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: Lost task 10.3 in stage 0.0 (TID 127, hpblade05): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at
[jira] [Comment Edited] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516801#comment-14516801 ] Littlestar edited comment on SPARK-7193 at 4/28/15 10:51 AM: - 1 master + 7 nodes (spark 1.3.1 + mesos 0.22.0/0.22.1) {noformat} 15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1 Spark context available as sc. 15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:23 scala distData.reduce(_+_) --- org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 17, hpblade06): ExecutorLostFailure (executor 20150427-165835-1214949568-5050-6-S0 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} was (Author: cnstar9988): {noformat} 15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1 Spark context available as sc. 15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:23 scala distData.reduce(_+_) --- org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 17, hpblade06): ExecutorLostFailure (executor 20150427-165835-1214949568-5050-6-S0 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} Spark on Mesos may need more tests for spark 1.3.1 release Key: SPARK-7193 URL: https://issues.apache.org/jira/browse/SPARK-7193 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.3.1 Reporter: Littlestar Spark on Mesos may need more tests for spark 1.3.1 release http://spark.apache.org/docs/latest/running-on-mesos.html I tested mesos
[jira] [Comment Edited] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516801#comment-14516801 ] Littlestar edited comment on SPARK-7193 at 4/28/15 10:53 AM: - 1 master + 7 nodes (spark 1.3.1 + mesos 0.22.0/0.22.1) {noformat} ./spark-shell --master mesos://hpblade02:5050 15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1 Spark context available as sc. 15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:23 scala distData.reduce(_+_) --- org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 17, hpblade06): ExecutorLostFailure (executor 20150427-165835-1214949568-5050-6-S0 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} was (Author: cnstar9988): 1 master + 7 nodes (spark 1.3.1 + mesos 0.22.0/0.22.1) {noformat} 15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1 Spark context available as sc. 15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive support).. SQL context available as sqlContext. scala val data = Array(1, 2, 3, 4, 5) data: Array[Int] = Array(1, 2, 3, 4, 5) scala val distData = sc.parallelize(data) distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at console:23 scala distData.reduce(_+_) --- org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 17, hpblade06): ExecutorLostFailure (executor 20150427-165835-1214949568-5050-6-S0 lost) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) {noformat} Spark on Mesos may need more tests for spark 1.3.1 release Key: SPARK-7193 URL: https://issues.apache.org/jira/browse/SPARK-7193 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.3.1 Reporter: Littlestar Spark on Mesos may need more tests for spark 1.3.1
[jira] [Commented] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516807#comment-14516807 ] Littlestar commented on SPARK-7193: --- exception on some mesos worknode log. {noformat} Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/executor/MesosExecutorBackend Caused by: java.lang.ClassNotFoundException: org.apache.spark.executor.MesosExecutorBackend at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: org.apache.spark.executor.MesosExecutorBackend. Program will exit. {noformat} Spark on Mesos may need more tests for spark 1.3.1 release Key: SPARK-7193 URL: https://issues.apache.org/jira/browse/SPARK-7193 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.3.1 Reporter: Littlestar Spark on Mesos may need more tests for spark 1.3.1 release http://spark.apache.org/docs/latest/running-on-mesos.html I tested mesos 0.21.1/0.22.0/0.22.1 RC4. It just work well with ./bin/spark-shell --master mesos://host:5050. Any task need more than one nodes, it will throws the following exceptions. {noformat} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: Lost task 10.3 in stage 0.0 (TID 127, hpblade05): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at
[jira] [Commented] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python
[ https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516815#comment-14516815 ] Apache Spark commented on SPARK-7133: - User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/5744 Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python -- Key: SPARK-7133 URL: https://issues.apache.org/jira/browse/SPARK-7133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Typing {code} df.col[1] {code} and {code} df.col['field'] {code} is so much eaiser than {code} df.col.getField('field') df.col.getItem(1) {code} This would require us to define (in Column) an apply function in Scala, and a __getitem__ function in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python
[ https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7133: --- Assignee: Apache Spark Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python -- Key: SPARK-7133 URL: https://issues.apache.org/jira/browse/SPARK-7133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Assignee: Apache Spark Labels: starter Typing {code} df.col[1] {code} and {code} df.col['field'] {code} is so much eaiser than {code} df.col.getField('field') df.col.getItem(1) {code} This would require us to define (in Column) an apply function in Scala, and a __getitem__ function in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7161) Provide REST api to download event logs from History Server
[ https://issues.apache.org/jira/browse/SPARK-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kostas Sakellis updated SPARK-7161: --- Component/s: (was: Streaming) Spark Core Provide REST api to download event logs from History Server --- Key: SPARK-7161 URL: https://issues.apache.org/jira/browse/SPARK-7161 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.3.1 Reporter: Hari Shreedharan Priority: Minor The idea is to tar up the logs and return the tar.gz file using a REST api. This can be used for debugging even after the app is done. I am planning to take a look at this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7203) Python API for local linear algebra
Joseph K. Bradley created SPARK-7203: Summary: Python API for local linear algebra Key: SPARK-7203 URL: https://issues.apache.org/jira/browse/SPARK-7203 Project: Spark Issue Type: Umbrella Components: MLlib, PySpark Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical This is an umbrella JIRA for the Python API for local linear algebra, including: * Vector, Matrix, and their subclasses * helper methods and utilities * interactions with numpy, scipy -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7202: - Issue Type: Sub-task (was: New Feature) Parent: SPARK-7203 Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517905#comment-14517905 ] Joseph K. Bradley commented on SPARK-7202: -- @MechCoder I just made an umbrella JIRA for Python local linear algebra. Please ping me if you find/make other JIRAs which should go there. Thanks! Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517905#comment-14517905 ] Joseph K. Bradley edited comment on SPARK-7202 at 4/28/15 8:01 PM: --- [~MechCoder] I just made an umbrella JIRA for Python local linear algebra. Please ping me if you find/make other JIRAs which should go there. Thanks! was (Author: josephkb): @MechCoder I just made an umbrella JIRA for Python local linear algebra. Please ping me if you find/make other JIRAs which should go there. Thanks! Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar Priority: Minor We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples
[ https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517858#comment-14517858 ] Chris Fregly edited comment on SPARK-7178 at 4/28/15 8:07 PM: -- added these to the forums AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html Nested Map Columns in DataFrames: https://forums.databricks.com/questions/764/how-do-i-create-a-dataframe-with-nested-map-column.html Casting columns of DataFrames: https://forums.databricks.com/questions/767/how-do-i-cast-within-a-dataframe.html was (Author: cfregly): added this to the forums to address the AND and OR: https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html Improve DataFrame documentation and code samples Key: SPARK-7178 URL: https://issues.apache.org/jira/browse/SPARK-7178 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.3.1 Reporter: Chris Fregly Labels: dataframe AND and OR are not straightforward when using the new DataFrame API. the current convention - accepted by Pandas users - is to use the bitwise and | instead of AND and OR. when using these, however, you need to wrap each expression in parenthesis to prevent the bitwise operator from dominating. also, working with StructTypes is a bit confusing. the following link: https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema (Python tab) implies that you can work with tuples directly when creating a DataFrame. however, the following code errors out unless we explicitly use Row's: {code} from pyspark.sql import Row from pyspark.sql.types import * # The schema is encoded in a string. schemaString = a fields = [StructField(field_name, MapType(StringType(),IntegerType())) for field_name in schemaString.split()] schema = StructType(fields) df = sqlContext.createDataFrame([Row(a={'b': 1})], schema) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API
[ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518305#comment-14518305 ] Apache Spark commented on SPARK-5182: - User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/5526 Partitioning support for tables created by the data source API -- Key: SPARK-5182 URL: https://issues.apache.org/jira/browse/SPARK-5182 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7215) Make repartition and coalesce a part of the query plan
Burak Yavuz created SPARK-7215: -- Summary: Make repartition and coalesce a part of the query plan Key: SPARK-7215 URL: https://issues.apache.org/jira/browse/SPARK-7215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Burak Yavuz Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7217) Add configuration to disable stopping of SparkContext when StreamingContext.stop()
Tathagata Das created SPARK-7217: Summary: Add configuration to disable stopping of SparkContext when StreamingContext.stop() Key: SPARK-7217 URL: https://issues.apache.org/jira/browse/SPARK-7217 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 1.3.1 Reporter: Tathagata Das Assignee: Tathagata Das In environments like notebooks, the SparkContext is managed by the underlying infrastructure and it is expected that the SparkContext will not be stopped. However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive side-effect. This JIRA is to add a configuration in SparkConf that sets the default StreamingContext stop behavior. It should be such that the existing behavior does not change for existing users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7138) Add method to BlockGenerator to add multiple records to BlockGenerator with single callback
[ https://issues.apache.org/jira/browse/SPARK-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-7138. -- Resolution: Fixed Fix Version/s: 1.4.0 Add method to BlockGenerator to add multiple records to BlockGenerator with single callback --- Key: SPARK-7138 URL: https://issues.apache.org/jira/browse/SPARK-7138 Project: Spark Issue Type: Improvement Components: Streaming Reporter: Tathagata Das Assignee: Tathagata Das Priority: Minor Fix For: 1.4.0 This is to ensure that receivers that receive data in small batches (like Kinesis) and want to add them but want the callback function to be called only once. This is for internal use only for improvement to Kinesis Receiver that we are planning to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI
[ https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7216: --- Assignee: (was: Apache Spark) Show driver details in Mesos cluster UI --- Key: SPARK-7216 URL: https://issues.apache.org/jira/browse/SPARK-7216 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Littlestar resolved SPARK-7193. --- Resolution: Invalid I think official document missing some notes about Spark on Mesos I worked well with following: extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME Spark on Mesos may need more tests for spark 1.3.1 release Key: SPARK-7193 URL: https://issues.apache.org/jira/browse/SPARK-7193 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.3.1 Reporter: Littlestar Spark on Mesos may need more tests for spark 1.3.1 release http://spark.apache.org/docs/latest/running-on-mesos.html I tested mesos 0.21.1/0.22.0/0.22.1 RC4. It just work well with ./bin/spark-shell --master mesos://host:5050. Any task need more than one nodes, it will throws the following exceptions. {noformat} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: Lost task 10.3 in stage 0.0 (TID 127, hpblade05): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener.onStageCompleted(EventLoggingListener.scala:165) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:32) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at
[jira] [Comment Edited] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release
[ https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518610#comment-14518610 ] Littlestar edited comment on SPARK-7193 at 4/29/15 2:40 AM: I think official document missing some notes about Spark on Mesos I worked well with following: extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs spark-env.sh set JAVA_HOME, HADOOP_CONF_DIR, HADOOP_HOME was (Author: cnstar9988): I think official document missing some notes about Spark on Mesos I worked well with following: extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME Spark on Mesos may need more tests for spark 1.3.1 release Key: SPARK-7193 URL: https://issues.apache.org/jira/browse/SPARK-7193 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.3.1 Reporter: Littlestar Spark on Mesos may need more tests for spark 1.3.1 release http://spark.apache.org/docs/latest/running-on-mesos.html I tested mesos 0.21.1/0.22.0/0.22.1 RC4. It just work well with ./bin/spark-shell --master mesos://host:5050. Any task need more than one nodes, it will throws the following exceptions. {noformat} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: Lost task 10.3 in stage 0.0 (TID 127, hpblade05): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:679) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener EventLoggingListener threw an exception java.lang.reflect.InvocationTargetException at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144) at scala.Option.foreach(Option.scala:236) at
[jira] [Resolved] (SPARK-6965) StringIndexer should convert input to Strings
[ https://issues.apache.org/jira/browse/SPARK-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-6965. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5753 [https://github.com/apache/spark/pull/5753] StringIndexer should convert input to Strings - Key: SPARK-6965 URL: https://issues.apache.org/jira/browse/SPARK-6965 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Xiangrui Meng Priority: Minor Fix For: 1.4.0 StringIndexer should convert non-String input types to String. That way, it can handle any basic types such as Int, Double, etc. It can convert any input type to strings first and store the string labels (instead of an arbitrary type). That will simplify model export/import. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518602#comment-14518602 ] Guoqiang Li commented on SPARK-5556: I put the latest LDA code in [Zen|https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering] The test results [here|https://issues.apache.org/jira/secure/attachment/12729030/LDA_test.xlsx] (72 cores, 216G ram, 6 servers, Gigabit Ethernet) Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518286#comment-14518286 ] Sandy Ryza commented on SPARK-3655: --- My opinion is that a secondary sort operator in core Spark would definitely be useful. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7215: --- Assignee: Apache Spark Make repartition and coalesce a part of the query plan -- Key: SPARK-7215 URL: https://issues.apache.org/jira/browse/SPARK-7215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Burak Yavuz Assignee: Apache Spark Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518378#comment-14518378 ] Pedro Rodriguez commented on SPARK-5556: I will start working on it again then. It would be great for that research project to result in Gibbs being added. The refactoring ended up roadblocking that quite a bit. I think [~gq] was working on something called LightLDA. I don't know the specifics of the algorithm, but the sampler scales theoretically O(1) with topics. My implementation has something which in the testing I did looks like in practice it is O(1) or very near it. To get Gibbs merged in (or as a candidate implementation), how does this look: 1. Refactor code to fit the PR that you just merged 2. Use the testing harness you used for the EM LDA to test with the same conditions. This should be fairly easy since you already did all the work to get things pipelining correctly. 3. If it scales well, then merge or consider other applications 4. Code review somewhere in there. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518379#comment-14518379 ] Apache Spark commented on SPARK-7215: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/5762 Make repartition and coalesce a part of the query plan -- Key: SPARK-7215 URL: https://issues.apache.org/jira/browse/SPARK-7215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Burak Yavuz Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan
[ https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7215: --- Assignee: (was: Apache Spark) Make repartition and coalesce a part of the query plan -- Key: SPARK-7215 URL: https://issues.apache.org/jira/browse/SPARK-7215 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Burak Yavuz Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI
[ https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7216: --- Assignee: Apache Spark Show driver details in Mesos cluster UI --- Key: SPARK-7216 URL: https://issues.apache.org/jira/browse/SPARK-7216 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Assignee: Apache Spark Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7216) Show driver details in Mesos cluster UI
Timothy Chen created SPARK-7216: --- Summary: Show driver details in Mesos cluster UI Key: SPARK-7216 URL: https://issues.apache.org/jira/browse/SPARK-7216 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7216) Show driver details in Mesos cluster UI
[ https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518447#comment-14518447 ] Apache Spark commented on SPARK-7216: - User 'tnachen' has created a pull request for this issue: https://github.com/apache/spark/pull/5763 Show driver details in Mesos cluster UI --- Key: SPARK-7216 URL: https://issues.apache.org/jira/browse/SPARK-7216 Project: Spark Issue Type: Improvement Components: Mesos Reporter: Timothy Chen Show driver details in Mesos cluster UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518601#comment-14518601 ] Pedro Rodriguez commented on SPARK-5556: [~gq] is the LDAGibbs line what I implemented or something else? In any case, the optimization on sampling shouldn't change the results, so it looks like LightLDA converges to a better perplexity. Do you have any performance graphs? Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7156) Add randomSplit method to DataFrame
[ https://issues.apache.org/jira/browse/SPARK-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518248#comment-14518248 ] Apache Spark commented on SPARK-7156: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/5761 Add randomSplit method to DataFrame --- Key: SPARK-7156 URL: https://issues.apache.org/jira/browse/SPARK-7156 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Joseph K. Bradley Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7214) Unrolling never evicts blocks when MemoryStore is nearly full
Charles Reiss created SPARK-7214: Summary: Unrolling never evicts blocks when MemoryStore is nearly full Key: SPARK-7214 URL: https://issues.apache.org/jira/browse/SPARK-7214 Project: Spark Issue Type: Bug Components: Block Manager Reporter: Charles Reiss Priority: Minor When less than spark.storage.unrollMemoryThreshold (default 1MB) is left in the MemoryStore, new blocks that are computed with unrollSafely (e.g. any cached RDD split) will always fail unrolling even if old blocks could be dropped to accommodate it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518400#comment-14518400 ] Joseph K. Bradley commented on SPARK-5556: -- That plan sounds good. I haven't yet been able to look into LightLDA, but it would be good to understand if it's (a) a modification which could be added to Gibbs later on or (b) an algorithm which belongs as a separate algorithm. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4721) Improve first thread to put block failed
[ https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4721: --- Assignee: (was: Apache Spark) Improve first thread to put block failed Key: SPARK-4721 URL: https://issues.apache.org/jira/browse/SPARK-4721 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 1.0.0 Reporter: SuYan In current code, it assumes that multi-thread try to put same blockID block in blockManager, the thread that first put info in blockinfos to do the put process, and others will wait until the put in failed or success. it's ok in put success, but if fails, have some problem: 1. the failed thread will remove info from blockinfo 2. other threads wake up, and use the old info.synchronized to try put 3. and if success, mark success will tell not in pending status, and “mark success” failed. all other remaining threads will do the same thing: got info.syn and mark success or failed even that have one success. first, I can't understand why remove info from blockinfos while there have other threads was wait. the comment tell us is for other threads to create new block info. but block info is just a ID and level, use the old one and the new one is doesn't matters if there any waits threads. second, how about if there first threads is failed, other waits thread can do the same process one by one but need less than all . or just if first thread is failed, all other threads log a warning and return after waking up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-4721) Improve first thread to put block failed
[ https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-4721: --- Assignee: Apache Spark Improve first thread to put block failed Key: SPARK-4721 URL: https://issues.apache.org/jira/browse/SPARK-4721 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 1.0.0 Reporter: SuYan Assignee: Apache Spark In current code, it assumes that multi-thread try to put same blockID block in blockManager, the thread that first put info in blockinfos to do the put process, and others will wait until the put in failed or success. it's ok in put success, but if fails, have some problem: 1. the failed thread will remove info from blockinfo 2. other threads wake up, and use the old info.synchronized to try put 3. and if success, mark success will tell not in pending status, and “mark success” failed. all other remaining threads will do the same thing: got info.syn and mark success or failed even that have one success. first, I can't understand why remove info from blockinfos while there have other threads was wait. the comment tell us is for other threads to create new block info. but block info is just a ID and level, use the old one and the new one is doesn't matters if there any waits threads. second, how about if there first threads is failed, other waits thread can do the same process one by one but need less than all . or just if first thread is failed, all other threads log a warning and return after waking up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python
[ https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7133: --- Assignee: (was: Apache Spark) Implement struct, array, and map field accessor using apply in Scala and __getitem__ in Python -- Key: SPARK-7133 URL: https://issues.apache.org/jira/browse/SPARK-7133 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin Labels: starter Typing {code} df.col[1] {code} and {code} df.col['field'] {code} is so much eaiser than {code} df.col.getField('field') df.col.getItem(1) {code} This would require us to define (in Column) an apply function in Scala, and a __getitem__ function in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7168) Update plugin versions in Maven build and centralize versions
[ https://issues.apache.org/jira/browse/SPARK-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7168. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5720 [https://github.com/apache/spark/pull/5720] Update plugin versions in Maven build and centralize versions - Key: SPARK-7168 URL: https://issues.apache.org/jira/browse/SPARK-7168 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 1.3.1 Reporter: Sean Owen Assignee: Sean Owen Priority: Trivial Fix For: 1.4.0 A minor cleanup before the next release: let's update the versions of build plugins used to the latest version while also pulling version management up into the parent, centrally. This only affects plugins and not the build result. Hopefully we'll pick up some tiny fixes along the way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6435: - Assignee: Masayoshi TSUZUKI spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Assignee: Masayoshi TSUZUKI Fix For: 1.4.0 Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6435) spark-shell --jars option does not add all jars to classpath
[ https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6435. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5227 [https://github.com/apache/spark/pull/5227] spark-shell --jars option does not add all jars to classpath Key: SPARK-6435 URL: https://issues.apache.org/jira/browse/SPARK-6435 Project: Spark Issue Type: Bug Components: Spark Shell, Windows Affects Versions: 1.3.0 Environment: Win64 Reporter: vijay Fix For: 1.4.0 Not all jars supplied via the --jars option will be added to the driver (and presumably executor) classpath. The first jar(s) will be added, but not all. To reproduce this, just add a few jars (I tested 5) to the --jars option, and then try to import a class from the last jar. This fails. A simple reproducer: Create a bunch of dummy jars: jar cfM jar1.jar log.txt jar cfM jar2.jar log.txt jar cfM jar3.jar log.txt jar cfM jar4.jar log.txt Start the spark-shell with the dummy jars and guava at the end: %SPARK_HOME%\bin\spark-shell --master local --jars jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar In the shell, try importing from guava; you'll get an error: {code} scala import com.google.common.base.Strings console:19: error: object Strings is not a member of package com.google.common.base import com.google.common.base.Strings ^ {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516854#comment-14516854 ] Sean Owen commented on SPARK-5189: -- [~jackli066519] You don't need to have this assigned to you, but I would work with [~nchammas] to understand first whether this is still relevant or what he's done. Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master --- Key: SPARK-5189 URL: https://issues.apache.org/jira/browse/SPARK-5189 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. I did some testing in {{us-east-1}}. This is, concretely, what the problem looks like: || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || | 1 | 8m 44s | | 10 | 13m 45s | | 25 | 22m 50s | | 50 | 37m 30s | | 75 | 51m 30s | | 99 | 1h 5m 30s | Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but I think the point is clear enough. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, configures it as a master or slave, and joins it to a cluster. * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets
[ https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516971#comment-14516971 ] Peter Marsh commented on SPARK-4414: I managed to get this to work by re-installing spark. Initially I had installed spark from source and built it locally, after removing that and installing spark-1.3.0-bin-hadoop2.4 (prebuilt) I was able to use wholeTextFiles(...) SparkContext.wholeTextFiles Doesn't work with S3 Buckets Key: SPARK-4414 URL: https://issues.apache.org/jira/browse/SPARK-4414 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: Pedro Rodriguez Priority: Critical SparkContext.wholeTextFiles does not read files which SparkContext.textFile can read. Below are general steps to reproduce, my specific case is following that on a git repo. Steps to reproduce. 1. Create Amazon S3 bucket, make public with multiple files 2. Attempt to read bucket with sc.wholeTextFiles(s3n://mybucket/myfile.txt) 3. Spark returns the following error, even if the file exists. Exception in thread main java.io.FileNotFoundException: File does not exist: /myfile.txt at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517) at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489) 4. Change the call to sc.textFile(s3n://mybucket/myfile.txt) and there is no error message, the application should run fine. There is a question on StackOverflow as well on this: http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist This is link to repo/lines of code. The uncommented call doesn't work, the commented call works as expected: https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19 It would be easy to use textFile with a multifile argument, but this should work correctly for s3 bucket files as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4721) Improve first thread to put block failed
[ https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-4721. -- Resolution: Won't Fix Improve first thread to put block failed Key: SPARK-4721 URL: https://issues.apache.org/jira/browse/SPARK-4721 Project: Spark Issue Type: Improvement Components: Block Manager Affects Versions: 1.0.0 Reporter: SuYan In current code, it assumes that multi-thread try to put same blockID block in blockManager, the thread that first put info in blockinfos to do the put process, and others will wait until the put in failed or success. it's ok in put success, but if fails, have some problem: 1. the failed thread will remove info from blockinfo 2. other threads wake up, and use the old info.synchronized to try put 3. and if success, mark success will tell not in pending status, and “mark success” failed. all other remaining threads will do the same thing: got info.syn and mark success or failed even that have one success. first, I can't understand why remove info from blockinfos while there have other threads was wait. the comment tell us is for other threads to create new block info. but block info is just a ID and level, use the old one and the new one is doesn't matters if there any waits threads. second, how about if there first threads is failed, other waits thread can do the same process one by one but need less than all . or just if first thread is failed, all other threads log a warning and return after waking up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7100) GradientBoostTrees leaks a persisted RDD
[ https://issues.apache.org/jira/browse/SPARK-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7100. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5669 [https://github.com/apache/spark/pull/5669] GradientBoostTrees leaks a persisted RDD Key: SPARK-7100 URL: https://issues.apache.org/jira/browse/SPARK-7100 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.2, 1.3.1 Reporter: Jim Carroll Priority: Minor Fix For: 1.4.0 It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it. In the master branch it's in GradientBoostedTrees.boost method. It persists the input RDD if it's not already persisted but doesn't unpersist it. I'll be submitting a PR with a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7100) GradientBoostTrees leaks a persisted RDD
[ https://issues.apache.org/jira/browse/SPARK-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-7100: - Assignee: Jim Carroll GradientBoostTrees leaks a persisted RDD Key: SPARK-7100 URL: https://issues.apache.org/jira/browse/SPARK-7100 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.2, 1.3.1 Reporter: Jim Carroll Assignee: Jim Carroll Priority: Minor Fix For: 1.4.0 It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never unpersist it. In the master branch it's in GradientBoostedTrees.boost method. It persists the input RDD if it's not already persisted but doesn't unpersist it. I'll be submitting a PR with a fix. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6627) Clean up of shuffle code and interfaces
[ https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518402#comment-14518402 ] Apache Spark commented on SPARK-6627: - User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/5764 Clean up of shuffle code and interfaces --- Key: SPARK-6627 URL: https://issues.apache.org/jira/browse/SPARK-6627 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Fix For: 1.4.0 The shuffle code in Spark is somewhat messy and could use some interface clean-up, especially with some larger changes outstanding. This is a catch all for what may be some small improvements in a few different PR's. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7169) Allow to specify metrics configuration more flexibly
[ https://issues.apache.org/jira/browse/SPARK-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518508#comment-14518508 ] Saisai Shao commented on SPARK-7169: Hi [~jlewandowski], regard to your second problem, I think you don't have to copy the metrics configuration file manually to every machine one by one, you could use spark-submit --file path/to/your/metrics_properties to transfer your configuration to each executor/container. And for the first problem, is it a big problem that all the configuration files need to be in the same directory? I think lot's of Spark as well as Hadoop conf file has such requirement. But you could configure driver/executor with different parameters in conf file, since MetricsSystem supports such features. Yes I think current metrics configuration may not so easy to use, any improvement is greatly appreciated :). Allow to specify metrics configuration more flexibly Key: SPARK-7169 URL: https://issues.apache.org/jira/browse/SPARK-7169 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.2, 1.3.1 Reporter: Jacek Lewandowski Priority: Minor Metrics are configured in {{metrics.properties}} file. Path to this file is specified in {{SparkConf}} at a key {{spark.metrics.conf}}. The property is read when {{MetricsSystem}} is created which means, during {{SparkEnv}} initialisation. h5.Problem When the user runs his application he has no way to provide the metrics configuration for executors. Although one can specify the path to metrics configuration file (1) the path is common for all the nodes and the client machine so there is implicit assumption that all the machines has same file in the same location, and (2) actually the user needs to copy the file manually to the worker nodes because the file is read before the user files are populated to the executor local directories. All of this makes it very difficult to play with the metrics configuration. h5. Proposed solution I think that the easiest and the most consistent solution would be to move the configuration from a separate file directly to {{SparkConf}}. We may prefix all the configuration settings from the metrics configuration by, say {{spark.metrics.props}}. For the backward compatibility, these properties would be loaded from the specified as it works now. Such a solution doesn't change the API so maybe it could be even included in patch release of Spark 1.2 and Spark 1.3. Appreciate any feedback. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7218) Create a real iterator with open/close for Spark SQL
Reynold Xin created SPARK-7218: -- Summary: Create a real iterator with open/close for Spark SQL Key: SPARK-7218 URL: https://issues.apache.org/jira/browse/SPARK-7218 Project: Spark Issue Type: New Feature Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5556: --- Attachment: LDA_test.xlsx Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517136#comment-14517136 ] Zhang, Liye commented on SPARK-7189: Yes, I think the current solution is a tradeoff, we can not simply changing the = to which will cause other problems. Anyway, I haven't think up any other solution yet, maybe others have some novel/nice ideas. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517289#comment-14517289 ] Marcelo Vanzin commented on SPARK-7189: --- Changing the {{=}} causes problems. If you want to fix this, you need to keep track of the log files that were loaded at the last timestamp, and ignore them if they still have that same timestamp when you re-list the log directory. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
Juliet Hougland created SPARK-7194: -- Summary: Vectors factors method for sparse vectors should accept the output of zipWithIndex Key: SPARK-7194 URL: https://issues.apache.org/jira/browse/SPARK-7194 Project: Spark Issue Type: Improvement Reporter: Juliet Hougland Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517280#comment-14517280 ] Sean Owen commented on SPARK-5529: -- [~arov] CDH always has the latest upstream minor release in minor releases, and back-ports maintenance release fixes into maintenance releases. This is on about the same 3-4 month cycle as Spark, so it's about as fast one could expect; CDH 5.4 = 1.3.x already. This change isn't even in a Spark release yet, so yes you want it to be back-ported to 1.3, probably. That has to precede ending up in CDH though. BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517281#comment-14517281 ] Alex Rovner commented on SPARK-5529: Applied patch to 1.3: https://github.com/apache/spark/pull/5745 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Juliet Hougland updated SPARK-7194: --- Description: Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. was: Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. Vectors factors method for sparse vectors should accept the output of zipWithIndex -- Key: SPARK-7194 URL: https://issues.apache.org/jira/browse/SPARK-7194 Project: Spark Issue Type: Improvement Reporter: Juliet Hougland Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7195) Can't start spark shell or pyspark in Windows 7
Mark Smiley created SPARK-7195: -- Summary: Can't start spark shell or pyspark in Windows 7 Key: SPARK-7195 URL: https://issues.apache.org/jira/browse/SPARK-7195 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell Affects Versions: 1.3.1 Environment: Windows 7, Java 8 (1.8.0_31) or Java 7 (1.7.0_79), Scala 2.11.6, Python 2.7 Reporter: Mark Smiley cd\spark\bin dir spark-shell yields following error: find: 'version': No such file or directory else was unexpected at this time Same error with spark-shell2.cmd PyShell starts but with errors and doesn't work properly once started (e.g., can't find sc). Can send screenshot of errors on request. Using Spark 1.3.1 for Hadoop 2.6 binary Note: Hadoop not installed on machine. Scala works by itself, Python works by itself Java works fine (I use it all the time) Based on another comment, tried Java 7 (1.7.0_79), but it made no difference (same error). JAVA_HOME = C:\jdk1.8.0\bin C:\jdk1.8.0\bin\;C:\Program Files (x86)\scala\bin;C:\Python27;c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;C:\Oracle\product64\12.1.0\client_1\bin;C:\Oracle\product\12.1.0\client_1\bin;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access Client\v5\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\MiKTeX 2.9\miktex\bin\x64\;C:\Program Files (x86)\ActivIdentity\ActivClient\;C:\Program Files\ActivIdentity\ActivClient\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517285#comment-14517285 ] Apache Spark commented on SPARK-5529: - User 'alexrovner' has created a pull request for this issue: https://github.com/apache/spark/pull/5745 BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517257#comment-14517257 ] Alex Rovner commented on SPARK-5529: CDH is usually somewhat slow on picking up the latest changes though. Would it be possible to backport this fix into 1.3? BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6756) Add compress() to Vector
[ https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-6756. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5756 [https://github.com/apache/spark/pull/5756] Add compress() to Vector Key: SPARK-6756 URL: https://issues.apache.org/jira/browse/SPARK-6756 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.4.0 Add compress to Vector that automatically convert the underlying vector to dense or sparse based on number of non-zeros. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7220) Check whether moving shared params is a compatible change
[ https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518747#comment-14518747 ] Xiangrui Meng commented on SPARK-7220: -- I compiled an example app that calls LinearRegression with elasticNetParam, then I moved methods under HasElasticNetParam to LinearRegressionParams. Without re-compiling, the app jar works with the new Spark assembly jar. So we could treat shared params as implementation details and we don't need to worry about where the methods get declared. Check whether moving shared params is a compatible change - Key: SPARK-7220 URL: https://issues.apache.org/jira/browse/SPARK-7220 Project: Spark Issue Type: Task Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.4.0 Shared params are private, and their usage are treated as implementation details. But we need to make sure moving params from shared to a concrete class is a compatible change. Otherwise, we shouldn't use shared params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex
[ https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7194: --- Assignee: Apache Spark Vectors factors method for sparse vectors should accept the output of zipWithIndex -- Key: SPARK-7194 URL: https://issues.apache.org/jira/browse/SPARK-7194 Project: Spark Issue Type: Improvement Reporter: Juliet Hougland Assignee: Apache Spark Let's say we have an RDD of Array[Double] where zero values are explictly recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD of sparse vectors, we currently have to: arr_doubles.map{ array = val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple = tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1)) Vectors.sparse(arrray.length, indexElem) } Notice that there is a map step at the end to switch the order of the index and the element value after .zipWithIndex. There should be a factory method on the Vectors class that allows you to avoid this flipping of tuple elements when using zipWithIndex. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7220) Check whether moving shared params is a compatible change
Xiangrui Meng created SPARK-7220: Summary: Check whether moving shared params is a compatible change Key: SPARK-7220 URL: https://issues.apache.org/jira/browse/SPARK-7220 Project: Spark Issue Type: Task Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Shared params are private, and their usage are treated as implementation details. But we need to make sure moving params from shared to a concrete class is a compatible change. Otherwise, we shouldn't use shared params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guoqiang Li updated SPARK-5556: --- Attachment: spark-summit.pptx Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe
[ https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Manoj Kumar updated SPARK-7202: --- Priority: Major (was: Minor) Add SparseMatrixPickler to SerDe Key: SPARK-7202 URL: https://issues.apache.org/jira/browse/SPARK-7202 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Reporter: Manoj Kumar We need Sparse MatrixPicker similar to that of DenseMatrixPickler. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated
[ https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518721#comment-14518721 ] Zhang, Liye commented on SPARK-7189: Hi [~vanzin], I think using timestamp is not that precise. This method is very similar with the way using modification time. There will always be situations that several operations finished within very short time (say less than 1 millisecond or even shorter). So timestamp and modification time can not be trusted. The target is to get the status change of the files, including contents change (write operation) and permission change (rename operation). `Inotify` can get the change but it's not available in HDFS before version 2.7. One way to tell the change is to set one flag after each operation and reset the flag after reloading the file. But this will make the code really ugly, a bad option. History server will always reload the same file even when no log file is updated Key: SPARK-7189 URL: https://issues.apache.org/jira/browse/SPARK-7189 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.3.0 Reporter: Zhang, Liye Priority: Minor History server will check every log file with it's modification time. It will reload the file if the file's modification time is later or equal to the latest modification time it remembered. So it will reload the same file(s) periodically if the file(s) with the latest modification time even if there is nothing change. This is not necessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518618#comment-14518618 ] Guoqiang Li commented on SPARK-5556: LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553. Yes LightLDA converge faster,but it takes up more memory Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518621#comment-14518621 ] Guoqiang Li commented on SPARK-5556: [spark-summit.pptx|https://issues.apache.org/jira/secure/attachment/12729035/spark-summit.pptx] has introduced the relevant algorithm Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7219) HashingTF should output ML attributes
Xiangrui Meng created SPARK-7219: Summary: HashingTF should output ML attributes Key: SPARK-7219 URL: https://issues.apache.org/jira/browse/SPARK-7219 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng HashingTF knows the output feature dimension, which should be in the output ML attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7219) HashingTF should output ML attributes
[ https://issues.apache.org/jira/browse/SPARK-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-7219: - Priority: Trivial (was: Major) HashingTF should output ML attributes - Key: SPARK-7219 URL: https://issues.apache.org/jira/browse/SPARK-7219 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Trivial HashingTF knows the output feature dimension, which should be in the output ML attributes. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7208. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5759 [https://github.com/apache/spark/pull/5759] Add Matrix, SparseMatrix to __all__ list in linalg.py - Key: SPARK-7208 URL: https://issues.apache.org/jira/browse/SPARK-7208 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
Saisai Shao created SPARK-7221: -- Summary: Expose the current processed file name of FileInputDStream to the users Key: SPARK-7221 URL: https://issues.apache.org/jira/browse/SPARK-7221 Project: Spark Issue Type: New Feature Components: Streaming Reporter: Saisai Shao Priority: Minor This is a wished feature from Spark user list (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). Currently there's no API to get the processed file name for FileInputDStream, it is useful if we can expose this to the users. The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users
[ https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-7221: --- Issue Type: Wish (was: New Feature) Expose the current processed file name of FileInputDStream to the users --- Key: SPARK-7221 URL: https://issues.apache.org/jira/browse/SPARK-7221 Project: Spark Issue Type: Wish Components: Streaming Reporter: Saisai Shao Priority: Minor This is a wished feature from Spark user list (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html). Currently there's no API to get the processed file name for FileInputDStream, it is useful if we can expose this to the users. The major problem is how to expose this to the users with an elegant way. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-7220) Check whether moving shared params is a compatible change
[ https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-7220. Resolution: Done Fix Version/s: 1.4.0 Check whether moving shared params is a compatible change - Key: SPARK-7220 URL: https://issues.apache.org/jira/browse/SPARK-7220 Project: Spark Issue Type: Task Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.4.0 Shared params are private, and their usage are treated as implementation details. But we need to make sure moving params from shared to a concrete class is a compatible change. Otherwise, we shouldn't use shared params. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master
[ https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517301#comment-14517301 ] Nicholas Chammas commented on SPARK-5189: - Yeah, as Sean said you can just start working on this whenever you want. Just let us know over here in a comment and that way others can know that someone is already working on this. This issue is still relevant, but unfortunately, solving it requires redesigning the whole of spark-ec2 to be able to provision nodes in parallel. This means changing the Bash scripts in the mesos/spark-ec2 repo to act on 1 node at a time, and changing the main spark-ec2 script itself to be multi-threaded (or somehow otherwise asynchronous) to be able to manage several nodes in parallel. It's probably a major effort, but you can definitely take it on if you are interested. Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master --- Key: SPARK-5189 URL: https://issues.apache.org/jira/browse/SPARK-5189 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, then setting up all the slaves together. This includes broadcasting files from the lonely master to potentially hundreds of slaves. There are 2 main problems with this approach: # Broadcasting files from the master to all slaves using [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] (e.g. during [ephemeral-hdfs init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36], or during [Spark setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3]) takes a long time. This time increases as the number of slaves increases. I did some testing in {{us-east-1}}. This is, concretely, what the problem looks like: || number of slaves ({{m3.large}}) || launch time (best of 6 tries) || | 1 | 8m 44s | | 10 | 13m 45s | | 25 | 22m 50s | | 50 | 37m 30s | | 75 | 51m 30s | | 99 | 1h 5m 30s | Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but I think the point is clear enough. # It's more complicated to add slaves to an existing cluster (a la [SPARK-2008]), since slaves are only configured through the master during the setup of the master itself. Logically, the operations we want to implement are: * Provision a Spark node * Join a node to a cluster (including an empty cluster) as either a master or a slave * Remove a node from a cluster We need our scripts to roughly be organized to match the above operations. The goals would be: # When launching a cluster, enable all cluster nodes to be provisioned in parallel, removing the master-to-slave file broadcast bottleneck. # Facilitate cluster modifications like adding or removing nodes. # Enable exploration of infrastructure tools like [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} internals and perhaps even allow us to build [one tool that launches Spark clusters on several different cloud platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw]. More concretely, the modifications we need to make are: * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with equivalent, slave-side operations. * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it fully creates a node that can be used as either a master or slave. * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, configures it as a master or slave, and joins it to a cluster. * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete that script. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
[ https://issues.apache.org/jira/browse/SPARK-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5253. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4259 [https://github.com/apache/spark/pull/4259] LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package --- Key: SPARK-5253 URL: https://issues.apache.org/jira/browse/SPARK-5253 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai Assignee: DB Tsai Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7198) VectorAssembler should carry ML metadata
Xiangrui Meng created SPARK-7198: Summary: VectorAssembler should carry ML metadata Key: SPARK-7198 URL: https://issues.apache.org/jira/browse/SPARK-7198 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.4.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Now it only outputs assembled vectors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7195) Can't start spark shell or pyspark in Windows 7
[ https://issues.apache.org/jira/browse/SPARK-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-7195. -- Resolution: Duplicate Have a look around JIRA first Can't start spark shell or pyspark in Windows 7 --- Key: SPARK-7195 URL: https://issues.apache.org/jira/browse/SPARK-7195 Project: Spark Issue Type: Bug Components: PySpark, Spark Shell Affects Versions: 1.3.1 Environment: Windows 7, Java 8 (1.8.0_31) or Java 7 (1.7.0_79), Scala 2.11.6, Python 2.7 Reporter: Mark Smiley cd\spark\bin dir spark-shell yields following error: find: 'version': No such file or directory else was unexpected at this time Same error with spark-shell2.cmd PyShell starts but with errors and doesn't work properly once started (e.g., can't find sc). Can send screenshot of errors on request. Using Spark 1.3.1 for Hadoop 2.6 binary Note: Hadoop not installed on machine. Scala works by itself, Python works by itself Java works fine (I use it all the time) Based on another comment, tried Java 7 (1.7.0_79), but it made no difference (same error). JAVA_HOME = C:\jdk1.8.0\bin C:\jdk1.8.0\bin\;C:\Program Files (x86)\scala\bin;C:\Python27;c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;C:\Oracle\product64\12.1.0\client_1\bin;C:\Oracle\product\12.1.0\client_1\bin;C:\ProgramData\Oracle\Java\javapath;C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access Client\v5\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software Stack\bin\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\MiKTeX 2.9\miktex\bin\x64\;C:\Program Files (x86)\ActivIdentity\ActivClient\;C:\Program Files\ActivIdentity\ActivClient\ -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor
[ https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517298#comment-14517298 ] Alex Rovner commented on SPARK-5529: Sorry to quickly pulled the trigger... Need to resolve some compilation errors BlockManager heartbeat expiration does not kill executor Key: SPARK-5529 URL: https://issues.apache.org/jira/browse/SPARK-5529 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.2.0 Reporter: Hong Shen Assignee: Hong Shen Fix For: 1.4.0 Attachments: SPARK-5529.patch When I run a spark job, one executor is hold, after 120s, blockManager is removed by driver, but after half an hour before the executor is remove by driver. Here is the log: {code} 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms exceeds 12ms 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 10.215.143.14: remote Akka client disassociated 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 0.0 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 10.215.143.14): ExecutorLostFailure (executor 1 lost) 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove non-existent executor 1 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0) 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 from BlockManagerMaster. 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column
Ali Bajwa created SPARK-7197: Summary: Join with DataFrame Python API not working properly with more than 1 column Key: SPARK-7197 URL: https://issues.apache.org/jira/browse/SPARK-7197 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.3.1 Reporter: Ali Bajwa It looks like join with DataFrames API in python does not return correct results if using more 2 or more columns. The example in the documentation only shows a single column. Here is an example to show the problem: Example code import pandas as pd from pyspark.sql import SQLContext hc = SQLContext(sc) A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5', '12', '12'], 'value': [100, 200, 300]}) a = hc.createDataFrame(A) B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'], 'value': [101, 102]}) b = hc.createDataFrame(B) print Pandas # try with Pandas print A print B print pd.merge(A, B, on=['year', 'month'], how='inner') print Spark print a.toPandas() print b.toPandas() print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas() *Output Pandas month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 Empty DataFrame Columns: [month, value_x, year, value_y] Index: [] Spark month value year 0 5100 1993 112200 2005 212300 1994 month value year 012101 1993 112102 1993 month value year month value year 012200 200512102 1993 112200 200512101 1993 212300 199412102 1993 312300 199412101 1993 It looks like Spark returns some results where an inner join should return nothing. Confirmed on user mailing list as an issue with Ayan Guha. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC
Ken Geis created SPARK-7196: --- Summary: decimal precision lost when loading DataFrame from JDBC Key: SPARK-7196 URL: https://issues.apache.org/jira/browse/SPARK-7196 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Ken Geis I have a decimal database field that is defined as 10.2 (i.e. ##.##). When I load it into Spark via sqlContext.jdbc(..), the type of the corresponding field in the DataFrame is DecimalType, with precisionInfo None. Because of that loss of precision information, SPARK-4176 is triggered when I try to .saveAsTable(..). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7140) Do not scan all values in Vector.hashCode
[ https://issues.apache.org/jira/browse/SPARK-7140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7140. -- Resolution: Fixed Fix Version/s: 1.4.0 1.3.2 Issue resolved by pull request 5697 [https://github.com/apache/spark/pull/5697] Do not scan all values in Vector.hashCode - Key: SPARK-7140 URL: https://issues.apache.org/jira/browse/SPARK-7140 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.3.2, 1.4.0 It makes hashCode really expensive. The Pyrolite version we are using in Spark calls it in serialization. Scanning the first few nonzeros should be sufficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517946#comment-14517946 ] koert kuipers edited comment on SPARK-3655 at 4/28/15 8:18 PM: --- since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: {noformat} def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] {noformat} (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. was (Author: koert): since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517946#comment-14517946 ] koert kuipers edited comment on SPARK-3655 at 4/28/15 8:19 PM: --- since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: {noformat} def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] {noformat} (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small spark-sorted library which is available on spark-packages, and that's good enough. was (Author: koert): since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: {noformat} def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] {noformat} (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7205: --- Assignee: (was: Apache Spark) Support local ivy cache in --packages - Key: SPARK-7205 URL: https://issues.apache.org/jira/browse/SPARK-7205 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Burak Yavuz Priority: Critical Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517988#comment-14517988 ] Apache Spark commented on SPARK-7205: - User 'brkyvz' has created a pull request for this issue: https://github.com/apache/spark/pull/5755 Support local ivy cache in --packages - Key: SPARK-7205 URL: https://issues.apache.org/jira/browse/SPARK-7205 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Burak Yavuz Priority: Critical Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages
[ https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7205: --- Assignee: Apache Spark Support local ivy cache in --packages - Key: SPARK-7205 URL: https://issues.apache.org/jira/browse/SPARK-7205 Project: Spark Issue Type: Bug Components: Spark Submit Reporter: Burak Yavuz Assignee: Apache Spark Priority: Critical Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations
Patrick Wendell created SPARK-7204: -- Summary: Call sites in UI are not accurate for DataFrame operations Key: SPARK-7204 URL: https://issues.apache.org/jira/browse/SPARK-7204 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.3.1 Reporter: Patrick Wendell Assignee: Patrick Wendell Priority: Critical Spark core computes callsites by climbing up the stack until we reach the stack frame at the boundary of user code and spark code. The way we compute whether a given frame is internal (Spark) or user code does not work correctly with the new dataframe API. Once the scope work goes in, we'll have a nicer way to express units of operator scope, but until then there is a simple fix where we just make sure the SQL internal classes are also skipped as we climb up the stack. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5338) Support cluster mode with Mesos
[ https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5338: - Affects Version/s: 1.0.0 Support cluster mode with Mesos --- Key: SPARK-5338 URL: https://issues.apache.org/jira/browse/SPARK-5338 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 1.0.0 Reporter: Timothy Chen Fix For: 1.4.0 Currently using Spark with Mesos, the only supported deployment is client mode. It is also useful to have a cluster mode deployment that can be shared and long running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517946#comment-14517946 ] koert kuipers commented on SPARK-3655: -- since the last pullreq for this ticket i created spark-sorted (based on suggestions from imran), a small library for spark that supports the target features of this ticket, but without the burden of having to be fully compatible with the current spark api conventions (with regards to ordering being implicit). i also got a chance to catch up with sandy at spark summit east and we exchanged some emails afterward about this jira ticket and possible design choices. so based on those experiences i think there are better alternatives than the current pullreq (https://github.com/apache/spark/pull/3632), and i will close it. the pullreq does bring secondary sort to spark, but only in memory, which is a very limited feature (since if the values can be stored in memory then sorting after the shuffle isn't really that hard, just wasteful). instead of the current pullreq i see 2 alternatives: 1) a new pullreq that introduces the mapStream api, which is very similar to the reduce operation as we know it in hadoop: an sorted streaming reduce. Its signature would be something like this on RDD[(K, V)]: def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)] (note that the implicits would not actually be on the method as shown here, but on a class conversion, similar to how PairRDDFunctions works. 2) don't to anything. the functionality this jira targets is already available in the small smart-sorted library which is available on spark-packages, and that's good enough. Support sorting of values in addition to keys (i.e. secondary sort) --- Key: SPARK-3655 URL: https://issues.apache.org/jira/browse/SPARK-3655 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.1.0, 1.2.0 Reporter: koert kuipers Assignee: Koert Kuipers Now that spark has a sort based shuffle, can we expect a secondary sort soon? There are some use cases where getting a sorted iterator of values per key is helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5338) Support cluster mode with Mesos
[ https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5338. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Timothy Chen Target Version/s: 1.4.0 Support cluster mode with Mesos --- Key: SPARK-5338 URL: https://issues.apache.org/jira/browse/SPARK-5338 Project: Spark Issue Type: Improvement Components: Mesos Affects Versions: 1.0.0 Reporter: Timothy Chen Assignee: Timothy Chen Fix For: 1.4.0 Currently using Spark with Mesos, the only supported deployment is client mode. It is also useful to have a cluster mode deployment that can be shared and long running. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6943) Graphically show RDD's included in a stage
[ https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518084#comment-14518084 ] Andrew Or commented on SPARK-6943: -- Yeah ideally we will have the job graph that magnifies into the stage graph. I'll see what I can do. Graphically show RDD's included in a stage -- Key: SPARK-6943 URL: https://issues.apache.org/jira/browse/SPARK-6943 Project: Spark Issue Type: Sub-task Components: Spark Core, SQL Reporter: Patrick Wendell Assignee: Andrew Or Attachments: DAGvisualizationintheSparkWebUI.pdf, with-closures.png, with-stack-trace.png -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518141#comment-14518141 ] Joseph K. Bradley commented on SPARK-5556: -- Great! I'm not aware of blockers. As far as other active implementations, the only ones I know of are those reference by [~gq] above. Please do ping him on your work and see if there are ideas which can be merged. We can help with the coordination and discussions as well. Thanks! Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5014) GaussianMixture (GMM) improvements
[ https://issues.apache.org/jira/browse/SPARK-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5014: - Comment: was deleted (was: No need for umbrella JIRA) GaussianMixture (GMM) improvements -- Key: SPARK-5014 URL: https://issues.apache.org/jira/browse/SPARK-5014 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley This is an umbrella JIRA for improvements to Gaussian Mixture Models (GMMs). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7208: --- Assignee: Apache Spark (was: Joseph K. Bradley) Add Matrix, SparseMatrix to __all__ list in linalg.py - Key: SPARK-7208 URL: https://issues.apache.org/jira/browse/SPARK-7208 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Apache Spark Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7208: --- Assignee: Joseph K. Bradley (was: Apache Spark) Add Matrix, SparseMatrix to __all__ list in linalg.py - Key: SPARK-7208 URL: https://issues.apache.org/jira/browse/SPARK-7208 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-7209) Adding new Manning book Spark in Action to the official Spark Webpage
Aleksandar Dragosavljevic created SPARK-7209: Summary: Adding new Manning book Spark in Action to the official Spark Webpage Key: SPARK-7209 URL: https://issues.apache.org/jira/browse/SPARK-7209 Project: Spark Issue Type: Task Components: Documentation Reporter: Aleksandar Dragosavljevic Priority: Minor Manning Publications is developing a book Spark in Action written by Marko Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be great if the book could be added to the list of books at the official Spark Webpage (https://spark.apache.org/documentation.html). This book teaches readers to use Spark for stream and batch data processing. It starts with an introduction to the Spark architecture and ecosystem followed by a taste of Spark's command line interface. Readers then discover the most fundamental concepts and abstractions of Spark, particularly Resilient Distributed Datasets (RDDs) and the basic data transformations that RDDs provide. The first part of the book also introduces you to writing Spark applications using the the core APIs. Next, you learn about different Spark components: how to work with structured data using Spark SQL, how to process near-real time data with Spark Streaming, how to apply machine learning algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped data using Spark GraphX, and a clear introduction to Spark clustering. The book is already available to the public as a part of our Manning Early Access Program (MEAP) where we deliver chapters to the public as soon as they are written. We believe it will offer significant support to the Spark users and the community. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518075#comment-14518075 ] Apache Spark commented on SPARK-7208: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/5759 Add Matrix, SparseMatrix to __all__ list in linalg.py - Key: SPARK-7208 URL: https://issues.apache.org/jira/browse/SPARK-7208 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518201#comment-14518201 ] Apache Spark commented on SPARK-7213: - User 'nishkamravi2' has created a pull request for this issue: https://github.com/apache/spark/pull/5760 Exception while copying Hadoop config files due to permission issues Key: SPARK-7213 URL: https://issues.apache.org/jira/browse/SPARK-7213 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7213) Exception while copying Hadoop config files due to permission issues
[ https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-7213: --- Assignee: Apache Spark Exception while copying Hadoop config files due to permission issues Key: SPARK-7213 URL: https://issues.apache.org/jira/browse/SPARK-7213 Project: Spark Issue Type: Bug Reporter: Nishkam Ravi Assignee: Apache Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7208) Add Matrix, SparseMatrix to __all__ list in linalg.py
[ https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7208: - Summary: Add Matrix, SparseMatrix to __all__ list in linalg.py (was: Add SparseMatrix to __all__ list in linalg.py) Add Matrix, SparseMatrix to __all__ list in linalg.py - Key: SPARK-7208 URL: https://issues.apache.org/jira/browse/SPARK-7208 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org