date:20150428


[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516852#comment-14516852
 ] 

Sean Owen commented on SPARK-7189:
--

Hm, I'd swear we had discussed this already and there was a good reason for it 
from [~vanzin], but I can't find the PR or JIRA now. I remember a PR changing 
the = to  and the result was that it was on purpose. Not sure if this was a 
helpful comment but I do remember something like this.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release


[ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516801#comment-14516801
 ] 

Littlestar commented on SPARK-7193:
---

{noformat}
15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1

Spark context available as sc.
15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive 
support)..
SQL context available as sqlContext.

scala val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:23

scala distData.reduce(_+_) 
---
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 
17, hpblade06): ExecutorLostFailure (executor 
20150427-165835-1214949568-5050-6-S0 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


{noformat}

 Spark on Mesos may need more tests for spark 1.3.1 release
 

 Key: SPARK-7193
 URL: https://issues.apache.org/jira/browse/SPARK-7193
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Littlestar

 Spark on Mesos may need more tests for spark 1.3.1 release
 http://spark.apache.org/docs/latest/running-on-mesos.html
 I tested mesos 0.21.1/0.22.0/0.22.1 RC4.
 It just work well with ./bin/spark-shell --master mesos://host:5050.
 Any task need more than one nodes, it will throws the following exceptions.
 {noformat}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: 
 Lost task 10.3 in stage 0.0 (TID 127, hpblade05): 
 java.lang.IllegalStateException: unread block data
   at 
 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:679)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
   at

[jira] [Comment Edited] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release


[ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516801#comment-14516801
 ] 

Littlestar edited comment on SPARK-7193 at 4/28/15 10:51 AM:
-

1 master + 7 nodes (spark 1.3.1 + mesos 0.22.0/0.22.1)

{noformat}
15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1

Spark context available as sc.
15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive 
support)..
SQL context available as sqlContext.

scala val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:23

scala distData.reduce(_+_) 
---
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 
17, hpblade06): ExecutorLostFailure (executor 
20150427-165835-1214949568-5050-6-S0 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


{noformat}


was (Author: cnstar9988):
{noformat}
15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1

Spark context available as sc.
15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive 
support)..
SQL context available as sqlContext.

scala val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:23

scala distData.reduce(_+_) 
---
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 
17, hpblade06): ExecutorLostFailure (executor 
20150427-165835-1214949568-5050-6-S0 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


{noformat}

 Spark on Mesos may need more tests for spark 1.3.1 release
 

 Key: SPARK-7193
 URL: https://issues.apache.org/jira/browse/SPARK-7193
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Littlestar

 Spark on Mesos may need more tests for spark 1.3.1 release
 http://spark.apache.org/docs/latest/running-on-mesos.html
 I tested mesos

[jira] [Comment Edited] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release


[ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516801#comment-14516801
 ] 

Littlestar edited comment on SPARK-7193 at 4/28/15 10:53 AM:
-

1 master + 7 nodes (spark 1.3.1 + mesos 0.22.0/0.22.1)

{noformat}
./spark-shell --master mesos://hpblade02:5050

15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1

Spark context available as sc.
15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive 
support)..
SQL context available as sqlContext.

scala val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:23

scala distData.reduce(_+_) 
---
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 
17, hpblade06): ExecutorLostFailure (executor 
20150427-165835-1214949568-5050-6-S0 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


{noformat}


was (Author: cnstar9988):
1 master + 7 nodes (spark 1.3.1 + mesos 0.22.0/0.22.1)

{noformat}
15/04/28 18:45:53 INFO spark.SparkContext: Running Spark version 1.3.1

Spark context available as sc.
15/04/28 18:45:57 INFO repl.SparkILoop: Created sql context (with Hive 
support)..
SQL context available as sqlContext.

scala val data = Array(1, 2, 3, 4, 5) 
data: Array[Int] = Array(1, 2, 3, 4, 5)

scala val distData = sc.parallelize(data)
distData: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at 
parallelize at console:23

scala distData.reduce(_+_) 
---
org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in 
stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 
17, hpblade06): ExecutorLostFailure (executor 
20150427-165835-1214949568-5050-6-S0 lost)
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)


{noformat}

 Spark on Mesos may need more tests for spark 1.3.1 release
 

 Key: SPARK-7193
 URL: https://issues.apache.org/jira/browse/SPARK-7193
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Littlestar

 Spark on Mesos may need more tests for spark 1.3.1

[jira] [Commented] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release


[ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516807#comment-14516807
 ] 

Littlestar commented on SPARK-7193:
---

exception on some mesos worknode log.
{noformat}
Exception in thread main java.lang.NoClassDefFoundError: 
org/apache/spark/executor/MesosExecutorBackend
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.executor.MesosExecutorBackend
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: org.apache.spark.executor.MesosExecutorBackend. 
Program will exit.
{noformat}

 Spark on Mesos may need more tests for spark 1.3.1 release
 

 Key: SPARK-7193
 URL: https://issues.apache.org/jira/browse/SPARK-7193
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Littlestar

 Spark on Mesos may need more tests for spark 1.3.1 release
 http://spark.apache.org/docs/latest/running-on-mesos.html
 I tested mesos 0.21.1/0.22.0/0.22.1 RC4.
 It just work well with ./bin/spark-shell --master mesos://host:5050.
 Any task need more than one nodes, it will throws the following exceptions.
 {noformat}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: 
 Lost task 10.3 in stage 0.0 (TID 127, hpblade05): 
 java.lang.IllegalStateException: unread block data
   at 
 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:679)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener 
 EventLoggingListener threw an exception
 java.lang.reflect.InvocationTargetException
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at

[jira] [Commented] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and getitem in Python


[ 
https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516815#comment-14516815
 ] 

Apache Spark commented on SPARK-7133:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/5744

 Implement struct, array, and map field accessor using apply in Scala and 
 __getitem__ in Python
 --

 Key: SPARK-7133
 URL: https://issues.apache.org/jira/browse/SPARK-7133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 Typing 
 {code}
 df.col[1]
 {code}
 and
 {code}
 df.col['field']
 {code}
 is so much eaiser than
 {code}
 df.col.getField('field')
 df.col.getItem(1)
 {code}
 This would require us to define (in Column) an apply function in Scala, and a 
 __getitem__ function in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and getitem in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7133:
---

Assignee: Apache Spark

 Implement struct, array, and map field accessor using apply in Scala and 
 __getitem__ in Python
 --

 Key: SPARK-7133
 URL: https://issues.apache.org/jira/browse/SPARK-7133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
  Labels: starter

 Typing 
 {code}
 df.col[1]
 {code}
 and
 {code}
 df.col['field']
 {code}
 is so much eaiser than
 {code}
 df.col.getField('field')
 df.col.getItem(1)
 {code}
 This would require us to define (in Column) an apply function in Scala, and a 
 __getitem__ function in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7161) Provide REST api to download event logs from History Server

2015-04-28 Thread Kostas Sakellis (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kostas Sakellis updated SPARK-7161:
---
Component/s: (was: Streaming)
 Spark Core

 Provide REST api to download event logs from History Server
 ---

 Key: SPARK-7161
 URL: https://issues.apache.org/jira/browse/SPARK-7161
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.3.1
Reporter: Hari Shreedharan
Priority: Minor

 The idea is to tar up the logs and return the tar.gz file using a REST api. 
 This can be used for debugging even after the app is done.
 I am planning to take a look at this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7203) Python API for local linear algebra

Joseph K. Bradley created SPARK-7203:


 Summary: Python API for local linear algebra
 Key: SPARK-7203
 URL: https://issues.apache.org/jira/browse/SPARK-7203
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical


This is an umbrella JIRA for the Python API for local linear algebra, including:
* Vector, Matrix, and their subclasses
* helper methods and utilities
* interactions with numpy, scipy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe


 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7202:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-7203

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7202) Add SparseMatrixPickler to SerDe


[ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517905#comment-14517905
 ] 

Joseph K. Bradley commented on SPARK-7202:
--

@MechCoder   I just made an umbrella JIRA for Python local linear algebra.  
Please ping me if you find/make other JIRAs which should go there.  Thanks!

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7202) Add SparseMatrixPickler to SerDe


[ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517905#comment-14517905
 ] 

Joseph K. Bradley edited comment on SPARK-7202 at 4/28/15 8:01 PM:
---

[~MechCoder]   I just made an umbrella JIRA for Python local linear algebra.  
Please ping me if you find/make other JIRAs which should go there.  Thanks!


was (Author: josephkb):
@MechCoder   I just made an umbrella JIRA for Python local linear algebra.  
Please ping me if you find/make other JIRAs which should go there.  Thanks!

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar
Priority: Minor

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7178) Improve DataFrame documentation and code samples

2015-04-28 Thread Chris Fregly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517858#comment-14517858
 ] 

Chris Fregly edited comment on SPARK-7178 at 4/28/15 8:07 PM:
--

added these to the forums

AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

Nested Map Columns in DataFrames:
https://forums.databricks.com/questions/764/how-do-i-create-a-dataframe-with-nested-map-column.html

Casting columns of DataFrames:
https://forums.databricks.com/questions/767/how-do-i-cast-within-a-dataframe.html


was (Author: cfregly):
added this to the forums to address the AND and OR:  
https://forums.databricks.com/questions/758/how-do-i-use-and-and-or-within-my-dataframe-operat.html

 Improve DataFrame documentation and code samples
 

 Key: SPARK-7178
 URL: https://issues.apache.org/jira/browse/SPARK-7178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Chris Fregly
  Labels: dataframe

 AND and OR are not straightforward when using the new DataFrame API.
 the current convention - accepted by Pandas users - is to use the bitwise  
 and | instead of AND and OR.  when using these, however, you need to wrap 
 each expression in parenthesis to prevent the bitwise operator from 
 dominating.
 also, working with StructTypes is a bit confusing.  the following link:  
 https://spark.apache.org/docs/latest/sql-programming-guide.html#programmatically-specifying-the-schema
  (Python tab) implies that you can work with tuples directly when creating a 
 DataFrame.
 however, the following code errors out unless we explicitly use Row's:
 {code}
 from pyspark.sql import Row
 from pyspark.sql.types import *
 # The schema is encoded in a string.
 schemaString = a
 fields = [StructField(field_name, MapType(StringType(),IntegerType())) for 
 field_name in schemaString.split()]
 schema = StructType(fields)
 df = sqlContext.createDataFrame([Row(a={'b': 1})], schema)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API


[ 
https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518305#comment-14518305
 ] 

Apache Spark commented on SPARK-5182:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/5526

 Partitioning support for tables created by the data source API
 --

 Key: SPARK-5182
 URL: https://issues.apache.org/jira/browse/SPARK-5182
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7215) Make repartition and coalesce a part of the query plan

2015-04-28 Thread Burak Yavuz (JIRA)

Burak Yavuz created SPARK-7215:
--

 Summary: Make repartition and coalesce a part of the query plan
 Key: SPARK-7215
 URL: https://issues.apache.org/jira/browse/SPARK-7215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Burak Yavuz
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7217) Add configuration to disable stopping of SparkContext when StreamingContext.stop()

2015-04-28 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-7217:


 Summary: Add configuration to disable stopping of SparkContext 
when StreamingContext.stop()
 Key: SPARK-7217
 URL: https://issues.apache.org/jira/browse/SPARK-7217
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.3.1
Reporter: Tathagata Das
Assignee: Tathagata Das


In environments like notebooks, the SparkContext is managed by the underlying 
infrastructure and it is expected that the SparkContext will not be stopped. 
However, StreamingContext.stop() calls SparkContext.stop() as a non-intuitive 
side-effect. This JIRA is to add a configuration in SparkConf that sets the 
default StreamingContext stop behavior. It should be such that the existing 
behavior does not change for existing users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7138) Add method to BlockGenerator to add multiple records to BlockGenerator with single callback

2015-04-28 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-7138.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

 Add method to BlockGenerator to add multiple records to BlockGenerator with 
 single callback
 ---

 Key: SPARK-7138
 URL: https://issues.apache.org/jira/browse/SPARK-7138
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor
 Fix For: 1.4.0


 This is to ensure that receivers that receive data in small batches (like 
 Kinesis) and want to add them but want the callback function to be called 
 only once.
 This is for internal use only for improvement to Kinesis Receiver that we are 
 planning to do. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI


 [ 
https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7216:
---

Assignee: (was: Apache Spark)

 Show driver details in Mesos cluster UI
 ---

 Key: SPARK-7216
 URL: https://issues.apache.org/jira/browse/SPARK-7216
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release


 [ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Littlestar resolved SPARK-7193.
---
Resolution: Invalid

I think official document missing some notes about Spark on Mesos

I worked well with following:

extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack 
with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs

spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME






 Spark on Mesos may need more tests for spark 1.3.1 release
 

 Key: SPARK-7193
 URL: https://issues.apache.org/jira/browse/SPARK-7193
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Littlestar

 Spark on Mesos may need more tests for spark 1.3.1 release
 http://spark.apache.org/docs/latest/running-on-mesos.html
 I tested mesos 0.21.1/0.22.0/0.22.1 RC4.
 It just work well with ./bin/spark-shell --master mesos://host:5050.
 Any task need more than one nodes, it will throws the following exceptions.
 {noformat}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: 
 Lost task 10.3 in stage 0.0 (TID 127, hpblade05): 
 java.lang.IllegalStateException: unread block data
   at 
 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:679)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener 
 EventLoggingListener threw an exception
 java.lang.reflect.InvocationTargetException
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener.onStageCompleted(EventLoggingListener.scala:165)
   at 
 org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:32)
   at 
 org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
   at

[jira] [Comment Edited] (SPARK-7193) Spark on Mesos may need more tests for spark 1.3.1 release


[ 
https://issues.apache.org/jira/browse/SPARK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518610#comment-14518610
 ] 

Littlestar edited comment on SPARK-7193 at 4/29/15 2:40 AM:


I think official document missing some notes about Spark on Mesos

I worked well with following:

extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack 
with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs

spark-env.sh set JAVA_HOME, HADOOP_CONF_DIR, HADOOP_HOME







was (Author: cnstar9988):
I think official document missing some notes about Spark on Mesos

I worked well with following:

extract spark-1.3.1-bin-hadoop2.4.tgz, and modify conf\spark-env.sh and repack 
with new spark-1.3.1-bin-hadoop2.4.tgz, and then put to hdfs

spark-env.sh set JAVA_HOME, HADOO_CONF_DIR, HADOO_HOME






 Spark on Mesos may need more tests for spark 1.3.1 release
 

 Key: SPARK-7193
 URL: https://issues.apache.org/jira/browse/SPARK-7193
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.3.1
Reporter: Littlestar

 Spark on Mesos may need more tests for spark 1.3.1 release
 http://spark.apache.org/docs/latest/running-on-mesos.html
 I tested mesos 0.21.1/0.22.0/0.22.1 RC4.
 It just work well with ./bin/spark-shell --master mesos://host:5050.
 Any task need more than one nodes, it will throws the following exceptions.
 {noformat}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task 10 in stage 0.0 failed 4 times, most recent failure: 
 Lost task 10.3 in stage 0.0 (TID 127, hpblade05): 
 java.lang.IllegalStateException: unread block data
   at 
 java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2393)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1378)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1963)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1887)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1346)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:368)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:68)
   at 
 org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:94)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:185)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
   at java.lang.Thread.run(Thread.java:679)
 Driver stacktrace:
   at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
   at scala.Option.foreach(Option.scala:236)
   at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
   at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 15/04/28 15:33:18 ERROR scheduler.LiveListenerBus: Listener 
 EventLoggingListener threw an exception
 java.lang.reflect.InvocationTargetException
   at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at 
 org.apache.spark.scheduler.EventLoggingListener$$anonfun$logEvent$3.apply(EventLoggingListener.scala:144)
   at scala.Option.foreach(Option.scala:236)
   at

[jira] [Resolved] (SPARK-6965) StringIndexer should convert input to Strings


 [ 
https://issues.apache.org/jira/browse/SPARK-6965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-6965.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5753
[https://github.com/apache/spark/pull/5753]

 StringIndexer should convert input to Strings
 -

 Key: SPARK-6965
 URL: https://issues.apache.org/jira/browse/SPARK-6965
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng
Priority: Minor
 Fix For: 1.4.0


 StringIndexer should convert non-String input types to String.  That way, it 
 can handle any basic types such as Int, Double, etc.
 It can convert any input type to strings first and store the string labels 
 (instead of an arbitrary type).  That will simplify model export/import.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518602#comment-14518602
 ] 

Guoqiang Li commented on SPARK-5556:


I put the latest LDA code in 
[Zen|https://github.com/witgo/zen/tree/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering]
  
The test results 
[here|https://issues.apache.org/jira/secure/attachment/12729030/LDA_test.xlsx] 
(72 cores, 216G ram, 6 servers, Gigabit Ethernet)

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread Sandy Ryza (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518286#comment-14518286
 ] 

Sandy Ryza commented on SPARK-3655:
---

My opinion is that a secondary sort operator in core Spark would definitely be 
useful.

 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan


 [ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7215:
---

Assignee: Apache Spark

 Make repartition and coalesce a part of the query plan
 --

 Key: SPARK-7215
 URL: https://issues.apache.org/jira/browse/SPARK-7215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Burak Yavuz
Assignee: Apache Spark
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Pedro Rodriguez (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518378#comment-14518378
]

Pedro Rodriguez commented on SPARK-5556:

I will start working on it again then. It would be great for that research
project to result in Gibbs being added. The refactoring ended up roadblocking
that quite a bit.

I think [~gq] was working on something called LightLDA. I don't know the
specifics of the algorithm, but the sampler scales theoretically O(1) with
topics. My implementation has something which in the testing I did looks like
in practice it is O(1) or very near it.

To get Gibbs merged in (or as a candidate implementation), how does this look:
1. Refactor code to fit the PR that you just merged
2. Use the testing harness you used for the EM LDA to test with the same
conditions. This should be fairly easy since you already did all the work to
get things pipelining correctly.
3. If it scales well, then merge or consider other applications
4. Code review somewhere in there.

Latent Dirichlet Allocation (LDA) using Gibbs sampler
--

Key: SPARK-5556
URL: https://issues.apache.org/jira/browse/SPARK-5556
Project: Spark
Issue Type: New Feature
Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7215) Make repartition and coalesce a part of the query plan


[ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518379#comment-14518379
 ] 

Apache Spark commented on SPARK-7215:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5762

 Make repartition and coalesce a part of the query plan
 --

 Key: SPARK-7215
 URL: https://issues.apache.org/jira/browse/SPARK-7215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Burak Yavuz
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7215) Make repartition and coalesce a part of the query plan


 [ 
https://issues.apache.org/jira/browse/SPARK-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7215:
---

Assignee: (was: Apache Spark)

 Make repartition and coalesce a part of the query plan
 --

 Key: SPARK-7215
 URL: https://issues.apache.org/jira/browse/SPARK-7215
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Burak Yavuz
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7216) Show driver details in Mesos cluster UI


 [ 
https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7216:
---

Assignee: Apache Spark

 Show driver details in Mesos cluster UI
 ---

 Key: SPARK-7216
 URL: https://issues.apache.org/jira/browse/SPARK-7216
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen
Assignee: Apache Spark

 Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7216) Show driver details in Mesos cluster UI

2015-04-28 Thread Timothy Chen (JIRA)

Timothy Chen created SPARK-7216:
---

 Summary: Show driver details in Mesos cluster UI
 Key: SPARK-7216
 URL: https://issues.apache.org/jira/browse/SPARK-7216
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen


Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7216) Show driver details in Mesos cluster UI


[ 
https://issues.apache.org/jira/browse/SPARK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518447#comment-14518447
 ] 

Apache Spark commented on SPARK-7216:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5763

 Show driver details in Mesos cluster UI
 ---

 Key: SPARK-7216
 URL: https://issues.apache.org/jira/browse/SPARK-7216
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Reporter: Timothy Chen

 Show driver details in Mesos cluster UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-04-28 Thread Pedro Rodriguez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518601#comment-14518601
 ] 

Pedro Rodriguez commented on SPARK-5556:


[~gq] is the LDAGibbs line what I implemented or something else? In any case, 
the optimization on sampling shouldn't change the results, so it looks like 
LightLDA converges to a better perplexity.

Do you have any performance graphs?

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7156) Add randomSplit method to DataFrame


[ 
https://issues.apache.org/jira/browse/SPARK-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518248#comment-14518248
 ] 

Apache Spark commented on SPARK-7156:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5761

 Add randomSplit method to DataFrame
 ---

 Key: SPARK-7156
 URL: https://issues.apache.org/jira/browse/SPARK-7156
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Joseph K. Bradley
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7214) Unrolling never evicts blocks when MemoryStore is nearly full

2015-04-28 Thread Charles Reiss (JIRA)

Charles Reiss created SPARK-7214:


 Summary: Unrolling never evicts blocks when MemoryStore is nearly 
full
 Key: SPARK-7214
 URL: https://issues.apache.org/jira/browse/SPARK-7214
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Reporter: Charles Reiss
Priority: Minor


When less than spark.storage.unrollMemoryThreshold (default 1MB) is left in the 
MemoryStore, new blocks that are computed with unrollSafely (e.g. any cached 
RDD split) will always fail unrolling even if old blocks could be dropped to 
accommodate it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518400#comment-14518400
 ] 

Joseph K. Bradley commented on SPARK-5556:
--

That plan sounds good.  I haven't yet been able to look into LightLDA, but it 
would be good to understand if it's (a) a modification which could be added to 
Gibbs later on or (b) an algorithm which belongs as a separate algorithm.

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4721) Improve first thread to put block failed

[
https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-4721:
---

Assignee: (was: Apache Spark)

Improve first thread to put block failed

Key: SPARK-4721
URL: https://issues.apache.org/jira/browse/SPARK-4721
Project: Spark
Issue Type: Improvement
Components: Block Manager
Affects Versions: 1.0.0
Reporter: SuYan

In current code, it assumes that multi-thread try to put same blockID block
in blockManager, the thread that first put info in blockinfos to do the put
process, and others will wait until the put in failed or success.
it's ok in put success, but if fails, have some problem:
1. the failed thread will remove info from blockinfo
2. other threads wake up, and use the old info.synchronized to try put
3. and if success, mark success will tell not in pending status, and “mark
success” failed. all other remaining threads will do the same thing: got
info.syn and mark success or failed even that have one success.
first, I can't understand why remove info from blockinfos while there have
other threads was wait. the comment tell us is for other threads to create
new block info. but block info is just a ID and level, use the old one and
the new one is doesn't matters if there any waits threads.
second, how about if there first threads is failed, other waits thread can do
the same process one by one but need less than all .
or just if first thread is failed, all other threads log a warning and return
after waking up.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-4721) Improve first thread to put block failed

[
https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-4721:
---

Assignee: Apache Spark

Improve first thread to put block failed

Key: SPARK-4721
URL: https://issues.apache.org/jira/browse/SPARK-4721
Project: Spark
Issue Type: Improvement
Components: Block Manager
Affects Versions: 1.0.0
Reporter: SuYan
Assignee: Apache Spark

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7133) Implement struct, array, and map field accessor using apply in Scala and getitem in Python


 [ 
https://issues.apache.org/jira/browse/SPARK-7133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7133:
---

Assignee: (was: Apache Spark)

 Implement struct, array, and map field accessor using apply in Scala and 
 __getitem__ in Python
 --

 Key: SPARK-7133
 URL: https://issues.apache.org/jira/browse/SPARK-7133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
  Labels: starter

 Typing 
 {code}
 df.col[1]
 {code}
 and
 {code}
 df.col['field']
 {code}
 is so much eaiser than
 {code}
 df.col.getField('field')
 df.col.getItem(1)
 {code}
 This would require us to define (in Column) an apply function in Scala, and a 
 __getitem__ function in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7168) Update plugin versions in Maven build and centralize versions


 [ 
https://issues.apache.org/jira/browse/SPARK-7168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7168.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5720
[https://github.com/apache/spark/pull/5720]

 Update plugin versions in Maven build and centralize versions
 -

 Key: SPARK-7168
 URL: https://issues.apache.org/jira/browse/SPARK-7168
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 1.3.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Trivial
 Fix For: 1.4.0


 A minor cleanup before the next release: let's update the versions of build 
 plugins used to the latest version while also pulling version management up 
 into the parent, centrally. This only affects plugins and not the build 
 result. Hopefully we'll pick up some tiny fixes along the way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6435) spark-shell --jars option does not add all jars to classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6435:
-
Assignee: Masayoshi TSUZUKI

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay
Assignee: Masayoshi TSUZUKI
 Fix For: 1.4.0


 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6435) spark-shell --jars option does not add all jars to classpath


 [ 
https://issues.apache.org/jira/browse/SPARK-6435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6435.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5227
[https://github.com/apache/spark/pull/5227]

 spark-shell --jars option does not add all jars to classpath
 

 Key: SPARK-6435
 URL: https://issues.apache.org/jira/browse/SPARK-6435
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, Windows
Affects Versions: 1.3.0
 Environment: Win64
Reporter: vijay
 Fix For: 1.4.0


 Not all jars supplied via the --jars option will be added to the driver (and 
 presumably executor) classpath.  The first jar(s) will be added, but not all.
 To reproduce this, just add a few jars (I tested 5) to the --jars option, and 
 then try to import a class from the last jar.  This fails.  A simple 
 reproducer: 
 Create a bunch of dummy jars:
 jar cfM jar1.jar log.txt
 jar cfM jar2.jar log.txt
 jar cfM jar3.jar log.txt
 jar cfM jar4.jar log.txt
 Start the spark-shell with the dummy jars and guava at the end:
 %SPARK_HOME%\bin\spark-shell --master local --jars 
 jar1.jar,jar2.jar,jar3.jar,jar4.jar,c:\code\lib\guava-14.0.1.jar
 In the shell, try importing from guava; you'll get an error:
 {code}
 scala import com.google.common.base.Strings
 console:19: error: object Strings is not a member of package 
 com.google.common.base
import com.google.common.base.Strings
   ^
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master


[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516854#comment-14516854
 ] 

Sean Owen commented on SPARK-5189:
--

[~jackli066519] You don't need to have this assigned to you, but I would work 
with [~nchammas] to understand first whether this is still relevant or what 
he's done.

 Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
 master
 ---

 Key: SPARK-5189
 URL: https://issues.apache.org/jira/browse/SPARK-5189
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas

 As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
 then setting up all the slaves together. This includes broadcasting files 
 from the lonely master to potentially hundreds of slaves.
 There are 2 main problems with this approach:
 # Broadcasting files from the master to all slaves using 
 [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
 (e.g. during [ephemeral-hdfs 
 init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
  or during [Spark 
 setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
  takes a long time. This time increases as the number of slaves increases.
  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
 looks like:
  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
 | 1 | 8m 44s |
 | 10 | 13m 45s |
 | 25 | 22m 50s |
 | 50 | 37m 30s |
 | 75 | 51m 30s |
 | 99 | 1h 5m 30s |
  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
 but I think the point is clear enough.
 # It's more complicated to add slaves to an existing cluster (a la 
 [SPARK-2008]), since slaves are only configured through the master during the 
 setup of the master itself.
 Logically, the operations we want to implement are:
 * Provision a Spark node
 * Join a node to a cluster (including an empty cluster) as either a master or 
 a slave
 * Remove a node from a cluster
 We need our scripts to roughly be organized to match the above operations. 
 The goals would be:
 # When launching a cluster, enable all cluster nodes to be provisioned in 
 parallel, removing the master-to-slave file broadcast bottleneck.
 # Facilitate cluster modifications like adding or removing nodes.
 # Enable exploration of infrastructure tools like 
 [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
 internals and perhaps even allow us to build [one tool that launches Spark 
 clusters on several different cloud 
 platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
 More concretely, the modifications we need to make are:
 * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
 equivalent, slave-side operations.
 * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
 it fully creates a node that can be used as either a master or slave.
 * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
 configures it as a master or slave, and joins it to a cluster.
 * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
 that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-04-28 Thread Peter Marsh (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14516971#comment-14516971
]

Peter Marsh commented on SPARK-4414:

I managed to get this to work by re-installing spark. Initially I had installed
spark from source and built it locally, after removing that and installing
spark-1.3.0-bin-hadoop2.4 (prebuilt) I was able to use wholeTextFiles(...)

SparkContext.wholeTextFiles Doesn't work with S3 Buckets

Key: SPARK-4414
URL: https://issues.apache.org/jira/browse/SPARK-4414
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: Pedro Rodriguez
Priority: Critical

SparkContext.wholeTextFiles does not read files which SparkContext.textFile
can read. Below are general steps to reproduce, my specific case is following
that on a git repo.
Steps to reproduce.
1. Create Amazon S3 bucket, make public with multiple files
2. Attempt to read bucket with
sc.wholeTextFiles(s3n://mybucket/myfile.txt)
3. Spark returns the following error, even if the file exists.
Exception in thread main java.io.FileNotFoundException: File does not
exist: /myfile.txt
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
at
org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.init(CombineFileInputFormat.java:489)
4. Change the call to
sc.textFile(s3n://mybucket/myfile.txt)
and there is no error message, the application should run fine.
There is a question on StackOverflow as well on this:
http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
This is link to repo/lines of code. The uncommented call doesn't work, the
commented call works as expected:
https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
It would be easy to use textFile with a multifile argument, but this should
work correctly for s3 bucket files as well.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4721) Improve first thread to put block failed

[
https://issues.apache.org/jira/browse/SPARK-4721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sean Owen resolved SPARK-4721.
--
Resolution: Won't Fix

Improve first thread to put block failed

Key: SPARK-4721
URL: https://issues.apache.org/jira/browse/SPARK-4721
Project: Spark
Issue Type: Improvement
Components: Block Manager
Affects Versions: 1.0.0
Reporter: SuYan

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7100) GradientBoostTrees leaks a persisted RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7100.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5669
[https://github.com/apache/spark/pull/5669]

 GradientBoostTrees leaks a persisted RDD
 

 Key: SPARK-7100
 URL: https://issues.apache.org/jira/browse/SPARK-7100
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.2, 1.3.1
Reporter: Jim Carroll
Priority: Minor
 Fix For: 1.4.0


 It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never 
 unpersist it.
 In the master branch it's in GradientBoostedTrees.boost method. It persists 
 the input RDD if it's not already persisted but doesn't unpersist it.
 I'll be submitting a PR with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7100) GradientBoostTrees leaks a persisted RDD


 [ 
https://issues.apache.org/jira/browse/SPARK-7100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7100:
-
Assignee: Jim Carroll

 GradientBoostTrees leaks a persisted RDD
 

 Key: SPARK-7100
 URL: https://issues.apache.org/jira/browse/SPARK-7100
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.2, 1.3.1
Reporter: Jim Carroll
Assignee: Jim Carroll
Priority: Minor
 Fix For: 1.4.0


 It appears GradientBoostedTrees.scala can call 'persist' on an RDD and never 
 unpersist it.
 In the master branch it's in GradientBoostedTrees.boost method. It persists 
 the input RDD if it's not already persisted but doesn't unpersist it.
 I'll be submitting a PR with a fix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6627) Clean up of shuffle code and interfaces


[ 
https://issues.apache.org/jira/browse/SPARK-6627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518402#comment-14518402
 ] 

Apache Spark commented on SPARK-6627:
-

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/5764

 Clean up of shuffle code and interfaces
 ---

 Key: SPARK-6627
 URL: https://issues.apache.org/jira/browse/SPARK-6627
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical
 Fix For: 1.4.0


 The shuffle code in Spark is somewhat messy and could use some interface 
 clean-up, especially with some larger changes outstanding. This is a catch 
 all for what may be some small improvements in a few different PR's.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7169) Allow to specify metrics configuration more flexibly

2015-04-28 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518508#comment-14518508
 ] 

Saisai Shao commented on SPARK-7169:


Hi [~jlewandowski], regard to your second problem, I think you don't have to 
copy the metrics configuration file manually to every machine one by one, you 
could use spark-submit --file path/to/your/metrics_properties to transfer your 
configuration to each executor/container.

And for the first problem, is it a big problem that all the configuration files 
need to be in the same directory? I think lot's of Spark as well as Hadoop conf 
file has such requirement. But you could configure driver/executor with 
different parameters in conf file, since MetricsSystem supports such features.

Yes I think current metrics configuration may not so easy to use, any 
improvement is greatly appreciated :).

 Allow to specify metrics configuration more flexibly
 

 Key: SPARK-7169
 URL: https://issues.apache.org/jira/browse/SPARK-7169
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.2, 1.3.1
Reporter: Jacek Lewandowski
Priority: Minor

 Metrics are configured in {{metrics.properties}} file. Path to this file is 
 specified in {{SparkConf}} at a key {{spark.metrics.conf}}. The property is 
 read when {{MetricsSystem}} is created which means, during {{SparkEnv}} 
 initialisation. 
 h5.Problem
 When the user runs his application he has no way to provide the metrics 
 configuration for executors. Although one can specify the path to metrics 
 configuration file (1) the path is common for all the nodes and the client 
 machine so there is implicit assumption that all the machines has same file 
 in the same location, and (2) actually the user needs to copy the file 
 manually to the worker nodes because the file is read before the user files 
 are populated to the executor local directories. All of this makes it very 
 difficult to play with the metrics configuration.
 h5. Proposed solution
 I think that the easiest and the most consistent solution would be to move 
 the configuration from a separate file directly to {{SparkConf}}. We may 
 prefix all the configuration settings from the metrics configuration by, say 
 {{spark.metrics.props}}. For the backward compatibility, these properties 
 would be loaded from the specified as it works now. Such a solution doesn't 
 change the API so maybe it could be even included in patch release of Spark 
 1.2 and Spark 1.3.
 Appreciate any feedback.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7218) Create a real iterator with open/close for Spark SQL

2015-04-28 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-7218:
--

 Summary: Create a real iterator with open/close for Spark SQL
 Key: SPARK-7218
 URL: https://issues.apache.org/jira/browse/SPARK-7218
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5556:
---
Attachment: LDA_test.xlsx

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Zhang, Liye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517136#comment-14517136
 ] 

Zhang, Liye commented on SPARK-7189:


Yes, I think the current solution is a tradeoff, we can not simply changing the 
= to  which will cause other problems. Anyway, I haven't think up any other 
solution yet, maybe others have some novel/nice ideas.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517289#comment-14517289
 ] 

Marcelo Vanzin commented on SPARK-7189:
---

Changing the {{=}} causes problems. If you want to fix this, you need to keep 
track of the log files that were loaded at the last timestamp, and ignore them 
if they still have that same timestamp when you re-list the log directory.

 History server will always reload the same file even when no log file is 
 updated
 

 Key: SPARK-7189
 URL: https://issues.apache.org/jira/browse/SPARK-7189
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

 History server will check every log file with it's modification time. It will 
 reload the file if the file's modification time is later or equal to the 
 latest modification time it remembered. So it will reload the same file(s) 
 periodically if the file(s) with the latest modification time even if there 
 is nothing change. This is not necessary.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Juliet Hougland (JIRA)

Juliet Hougland created SPARK-7194:
--

 Summary: Vectors factors method for sparse vectors should accept 
the output of zipWithIndex
 Key: SPARK-7194
 URL: https://issues.apache.org/jira/browse/SPARK-7194
 Project: Spark
  Issue Type: Improvement
Reporter: Juliet Hougland


Let's say we have an RDD of Array[Double] where zero values are explictly 
recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
of sparse vectors, we currently have to:

arr_doubles.map{ array =
   val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
Vectors.sparse(arrray.length, indexElem)
}

Notice that there is a map step at the end to switch the order of the index and 
the element value after .zipWithIndex. There should be a factory method on the 
Vectors class that allows you to avoid this flipping of tuple elements when 
using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor


[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517280#comment-14517280
 ] 

Sean Owen commented on SPARK-5529:
--

[~arov] CDH always has the latest upstream minor release in minor releases, and 
back-ports maintenance release fixes into maintenance releases. This is on 
about the same 3-4 month cycle as Spark, so it's about as fast one could 
expect; CDH 5.4 = 1.3.x already. This change isn't even in a Spark release yet, 
so yes you want it to be back-ported to 1.3, probably. That has to precede 
ending up in CDH though.

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517281#comment-14517281
 ] 

Alex Rovner commented on SPARK-5529:


Applied patch to 1.3: https://github.com/apache/spark/pull/5745

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex

2015-04-28 Thread Juliet Hougland (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliet Hougland updated SPARK-7194:
---
Description: 
Let's say we have an RDD of Array[Double] where zero values are explictly 
recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
of sparse vectors, we currently have to:

arr_doubles.map{ array =
   val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))

Vectors.sparse(arrray.length, indexElem)
}

Notice that there is a map step at the end to switch the order of the index and 
the element value after .zipWithIndex. There should be a factory method on the 
Vectors class that allows you to avoid this flipping of tuple elements when 
using zipWithIndex.

  was:
Let's say we have an RDD of Array[Double] where zero values are explictly 
recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
of sparse vectors, we currently have to:

arr_doubles.map{ array =
   val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
Vectors.sparse(arrray.length, indexElem)
}

Notice that there is a map step at the end to switch the order of the index and 
the element value after .zipWithIndex. There should be a factory method on the 
Vectors class that allows you to avoid this flipping of tuple elements when 
using zipWithIndex.


 Vectors factors method for sparse vectors should accept the output of 
 zipWithIndex
 --

 Key: SPARK-7194
 URL: https://issues.apache.org/jira/browse/SPARK-7194
 Project: Spark
  Issue Type: Improvement
Reporter: Juliet Hougland

 Let's say we have an RDD of Array[Double] where zero values are explictly 
 recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
 of sparse vectors, we currently have to:
 arr_doubles.map{ array =
val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
 tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
 Vectors.sparse(arrray.length, indexElem)
 }
 Notice that there is a map step at the end to switch the order of the index 
 and the element value after .zipWithIndex. There should be a factory method 
 on the Vectors class that allows you to avoid this flipping of tuple elements 
 when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7195) Can't start spark shell or pyspark in Windows 7

2015-04-28 Thread Mark Smiley (JIRA)

Mark Smiley created SPARK-7195:
--

 Summary: Can't start spark shell or pyspark in Windows 7
 Key: SPARK-7195
 URL: https://issues.apache.org/jira/browse/SPARK-7195
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell
Affects Versions: 1.3.1
 Environment: Windows 7, Java 8 (1.8.0_31) or Java 7 (1.7.0_79), Scala 
2.11.6, Python 2.7
Reporter: Mark Smiley


cd\spark\bin dir
spark-shell
yields following error:
find: 'version': No such file or directory
else was unexpected at this time

Same error with 
spark-shell2.cmd

PyShell starts but with errors and doesn't work properly once started
(e.g., can't find sc). Can send screenshot of errors on request.

Using Spark 1.3.1 for Hadoop 2.6 binary
Note: Hadoop not installed on machine.
Scala works by itself, Python works by itself
Java works fine (I use it all the time)

Based on another comment, tried Java 7 (1.7.0_79), but it made no difference 
(same error).

JAVA_HOME = C:\jdk1.8.0\bin
C:\jdk1.8.0\bin\;C:\Program Files 
(x86)\scala\bin;C:\Python27;c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;C:\Oracle\product64\12.1.0\client_1\bin;C:\Oracle\product\12.1.0\client_1\bin;C:\ProgramData\Oracle\Java\javapath;C:\Program
 Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS 
Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
 Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program 
Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files 
(x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files 
(x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program 
Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access 
Client\v5\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software 
Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software 
Stack\bin\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program Files 
(x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\MiKTeX 
2.9\miktex\bin\x64\;C:\Program Files 
(x86)\ActivIdentity\ActivClient\;C:\Program Files\ActivIdentity\ActivClient\




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor


[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517285#comment-14517285
 ] 

Apache Spark commented on SPARK-5529:
-

User 'alexrovner' has created a pull request for this issue:
https://github.com/apache/spark/pull/5745

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517257#comment-14517257
 ] 

Alex Rovner commented on SPARK-5529:


CDH is usually somewhat slow on picking up the latest changes though. Would it 
be possible to backport this fix into 1.3?

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6756) Add compress() to Vector


 [ 
https://issues.apache.org/jira/browse/SPARK-6756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6756.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5756
[https://github.com/apache/spark/pull/5756]

 Add compress() to Vector
 

 Key: SPARK-6756
 URL: https://issues.apache.org/jira/browse/SPARK-6756
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.4.0


 Add compress to Vector that automatically convert the underlying vector to 
 dense or sparse based on number of non-zeros.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7220) Check whether moving shared params is a compatible change


[ 
https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518747#comment-14518747
 ] 

Xiangrui Meng commented on SPARK-7220:
--

I compiled an example app that calls LinearRegression with elasticNetParam, 
then I moved methods under HasElasticNetParam to LinearRegressionParams. 
Without re-compiling, the app jar works with the new Spark assembly jar. So we 
could treat shared params as implementation details and we don't need to worry 
about where the methods get declared.

 Check whether moving shared params is a compatible change
 -

 Key: SPARK-7220
 URL: https://issues.apache.org/jira/browse/SPARK-7220
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.4.0


 Shared params are private, and their usage are treated as implementation 
 details. But we need to make sure moving params from shared to a concrete 
 class is a compatible change. Otherwise, we shouldn't use shared params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7194) Vectors factors method for sparse vectors should accept the output of zipWithIndex


 [ 
https://issues.apache.org/jira/browse/SPARK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7194:
---

Assignee: Apache Spark

 Vectors factors method for sparse vectors should accept the output of 
 zipWithIndex
 --

 Key: SPARK-7194
 URL: https://issues.apache.org/jira/browse/SPARK-7194
 Project: Spark
  Issue Type: Improvement
Reporter: Juliet Hougland
Assignee: Apache Spark

 Let's say we have an RDD of Array[Double] where zero values are explictly 
 recorded. Ie (0.0, 0.0, 3.2, 0.0...) If we want to transform this into an RDD 
 of sparse vectors, we currently have to:
 arr_doubles.map{ array =
val indexElem: Seq[(Int, Double)] = array.zipWithIndex.filter(tuple =  
 tuple._1 != 0.0).map(tuple = (tuple._2, tuple._1))
 Vectors.sparse(arrray.length, indexElem)
 }
 Notice that there is a map step at the end to switch the order of the index 
 and the element value after .zipWithIndex. There should be a factory method 
 on the Vectors class that allows you to avoid this flipping of tuple elements 
 when using zipWithIndex.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7220) Check whether moving shared params is a compatible change

Xiangrui Meng created SPARK-7220:


 Summary: Check whether moving shared params is a compatible change
 Key: SPARK-7220
 URL: https://issues.apache.org/jira/browse/SPARK-7220
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


Shared params are private, and their usage are treated as implementation 
details. But we need to make sure moving params from shared to a concrete class 
is a compatible change. Otherwise, we shouldn't use shared params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-5556:
---
Attachment: spark-summit.pptx

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7202) Add SparseMatrixPickler to SerDe

2015-04-28 Thread Manoj Kumar (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manoj Kumar updated SPARK-7202:
---
Priority: Major  (was: Minor)

 Add SparseMatrixPickler to SerDe
 

 Key: SPARK-7202
 URL: https://issues.apache.org/jira/browse/SPARK-7202
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Reporter: Manoj Kumar

 We need Sparse MatrixPicker similar to that of DenseMatrixPickler.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7189) History server will always reload the same file even when no log file is updated

2015-04-28 Thread Zhang, Liye (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-7189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518721#comment-14518721
]

Zhang, Liye commented on SPARK-7189:

Hi [~vanzin], I think using timestamp is not that precise. This method is very
similar with the way using modification time. There will always be situations
that several operations finished within very short time (say less than 1
millisecond or even shorter). So timestamp and modification time can not be
trusted.

The target is to get the status change of the files, including contents change
(write operation) and permission change (rename operation). `Inotify` can get
the change but it's not available in HDFS before version 2.7. One way to tell
the change is to set one flag after each operation and reset the flag after
reloading the file. But this will make the code really ugly, a bad option.

History server will always reload the same file even when no log file is
updated

Key: SPARK-7189
URL: https://issues.apache.org/jira/browse/SPARK-7189
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.3.0
Reporter: Zhang, Liye
Priority: Minor

History server will check every log file with it's modification time. It will
reload the file if the file's modification time is later or equal to the
latest modification time it remembered. So it will reload the same file(s)
periodically if the file(s) with the latest modification time even if there
is nothing change. This is not necessary.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518618#comment-14518618
 ] 

Guoqiang Li commented on SPARK-5556:


LDA_Gibbs combines the advantages of AliasLDA, FastLDA and SparseLDA algorithm. 
 The corresponding code is https://github.com/witgo/spark/tree/lda_Gibbs or  
https://github.com/witgo/zen/blob/master/ml/src/main/scala/com/github/cloudml/zen/ml/clustering/LDA.scala#L553.

Yes LightLDA converge faster,but it takes up more memory




 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518621#comment-14518621
 ] 

Guoqiang Li commented on SPARK-5556:


[spark-summit.pptx|https://issues.apache.org/jira/secure/attachment/12729035/spark-summit.pptx]
 has introduced the relevant algorithm

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7219) HashingTF should output ML attributes

Xiangrui Meng created SPARK-7219:


 Summary: HashingTF should output ML attributes
 Key: SPARK-7219
 URL: https://issues.apache.org/jira/browse/SPARK-7219
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


HashingTF knows the output feature dimension, which should be in the output ML 
attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7219) HashingTF should output ML attributes


 [ 
https://issues.apache.org/jira/browse/SPARK-7219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7219:
-
Priority: Trivial  (was: Major)

 HashingTF should output ML attributes
 -

 Key: SPARK-7219
 URL: https://issues.apache.org/jira/browse/SPARK-7219
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Trivial

 HashingTF knows the output feature dimension, which should be in the output 
 ML attributes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7208) Add Matrix, SparseMatrix to all list in linalg.py


 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7208.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5759
[https://github.com/apache/spark/pull/5759]

 Add Matrix, SparseMatrix to __all__ list in linalg.py
 -

 Key: SPARK-7208
 URL: https://issues.apache.org/jira/browse/SPARK-7208
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-04-28 Thread Saisai Shao (JIRA)

Saisai Shao created SPARK-7221:
--

 Summary: Expose the current processed file name of 
FileInputDStream to the users
 Key: SPARK-7221
 URL: https://issues.apache.org/jira/browse/SPARK-7221
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Saisai Shao
Priority: Minor


This is a wished feature from Spark user list 
(http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
 Currently there's no API to get the processed file name for FileInputDStream, 
it is useful if we can expose this to the users. 

The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7221) Expose the current processed file name of FileInputDStream to the users

2015-04-28 Thread Saisai Shao (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-7221:
---
Issue Type: Wish  (was: New Feature)

 Expose the current processed file name of FileInputDStream to the users
 ---

 Key: SPARK-7221
 URL: https://issues.apache.org/jira/browse/SPARK-7221
 Project: Spark
  Issue Type: Wish
  Components: Streaming
Reporter: Saisai Shao
Priority: Minor

 This is a wished feature from Spark user list 
 (http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-textFileStream-fileStream-Get-file-name-tt22692.html).
  Currently there's no API to get the processed file name for 
 FileInputDStream, it is useful if we can expose this to the users. 
 The major problem is how to expose this to the users with an elegant way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-7220) Check whether moving shared params is a compatible change


 [ 
https://issues.apache.org/jira/browse/SPARK-7220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-7220.

   Resolution: Done
Fix Version/s: 1.4.0

 Check whether moving shared params is a compatible change
 -

 Key: SPARK-7220
 URL: https://issues.apache.org/jira/browse/SPARK-7220
 Project: Spark
  Issue Type: Task
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.4.0


 Shared params are private, and their usage are treated as implementation 
 details. But we need to make sure moving params from shared to a concrete 
 class is a compatible change. Otherwise, we shouldn't use shared params.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-04-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517301#comment-14517301
 ] 

Nicholas Chammas commented on SPARK-5189:
-

Yeah, as Sean said you can just start working on this whenever you want. Just 
let us know over here in a comment and that way others can know that someone is 
already working on this.

This issue is still relevant, but unfortunately, solving it requires 
redesigning the whole of spark-ec2 to be able to provision nodes in parallel. 
This means changing the Bash scripts in the mesos/spark-ec2 repo to act on 1 
node at a time, and changing the main spark-ec2 script itself to be 
multi-threaded (or somehow otherwise asynchronous) to be able to manage several 
nodes in parallel.

It's probably a major effort, but you can definitely take it on if you are 
interested.

 Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
 master
 ---

 Key: SPARK-5189
 URL: https://issues.apache.org/jira/browse/SPARK-5189
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas

 As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
 then setting up all the slaves together. This includes broadcasting files 
 from the lonely master to potentially hundreds of slaves.
 There are 2 main problems with this approach:
 # Broadcasting files from the master to all slaves using 
 [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
 (e.g. during [ephemeral-hdfs 
 init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
  or during [Spark 
 setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
  takes a long time. This time increases as the number of slaves increases.
  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
 looks like:
  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
 | 1 | 8m 44s |
 | 10 | 13m 45s |
 | 25 | 22m 50s |
 | 50 | 37m 30s |
 | 75 | 51m 30s |
 | 99 | 1h 5m 30s |
  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
 but I think the point is clear enough.
 # It's more complicated to add slaves to an existing cluster (a la 
 [SPARK-2008]), since slaves are only configured through the master during the 
 setup of the master itself.
 Logically, the operations we want to implement are:
 * Provision a Spark node
 * Join a node to a cluster (including an empty cluster) as either a master or 
 a slave
 * Remove a node from a cluster
 We need our scripts to roughly be organized to match the above operations. 
 The goals would be:
 # When launching a cluster, enable all cluster nodes to be provisioned in 
 parallel, removing the master-to-slave file broadcast bottleneck.
 # Facilitate cluster modifications like adding or removing nodes.
 # Enable exploration of infrastructure tools like 
 [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
 internals and perhaps even allow us to build [one tool that launches Spark 
 clusters on several different cloud 
 platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
 More concretely, the modifications we need to make are:
 * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
 equivalent, slave-side operations.
 * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
 it fully creates a node that can be used as either a master or slave.
 * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
 configures it as a master or slave, and joins it to a cluster.
 * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
 that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package


 [ 
https://issues.apache.org/jira/browse/SPARK-5253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5253.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4259
[https://github.com/apache/spark/pull/4259]

 LinearRegression with L1/L2 (elastic net) using OWLQN in new ML package
 ---

 Key: SPARK-5253
 URL: https://issues.apache.org/jira/browse/SPARK-5253
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7198) VectorAssembler should carry ML metadata

Xiangrui Meng created SPARK-7198:


 Summary: VectorAssembler should carry ML metadata
 Key: SPARK-7198
 URL: https://issues.apache.org/jira/browse/SPARK-7198
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Now it only outputs assembled vectors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7195) Can't start spark shell or pyspark in Windows 7


 [ 
https://issues.apache.org/jira/browse/SPARK-7195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7195.
--
Resolution: Duplicate

Have a look around JIRA first

 Can't start spark shell or pyspark in Windows 7
 ---

 Key: SPARK-7195
 URL: https://issues.apache.org/jira/browse/SPARK-7195
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell
Affects Versions: 1.3.1
 Environment: Windows 7, Java 8 (1.8.0_31) or Java 7 (1.7.0_79), Scala 
 2.11.6, Python 2.7
Reporter: Mark Smiley

 cd\spark\bin dir
 spark-shell
 yields following error:
 find: 'version': No such file or directory
 else was unexpected at this time
 Same error with 
 spark-shell2.cmd
 PyShell starts but with errors and doesn't work properly once started
 (e.g., can't find sc). Can send screenshot of errors on request.
 Using Spark 1.3.1 for Hadoop 2.6 binary
 Note: Hadoop not installed on machine.
 Scala works by itself, Python works by itself
 Java works fine (I use it all the time)
 Based on another comment, tried Java 7 (1.7.0_79), but it made no difference 
 (same error).
 JAVA_HOME = C:\jdk1.8.0\bin
 C:\jdk1.8.0\bin\;C:\Program Files 
 (x86)\scala\bin;C:\Python27;c:\Rtools\bin;c:\Rtools\gcc-4.6.3\bin;C:\Oracle\product64\12.1.0\client_1\bin;C:\Oracle\product\12.1.0\client_1\bin;C:\ProgramData\Oracle\Java\javapath;C:\Program
  Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS 
 Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program
  Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program 
 Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files 
 (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files 
 (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program 
 Files\Dell\Dell Data Protection\Access\Advanced\Wave\Gemalto\Access 
 Client\v5\;C:\Program Files (x86)\NTRU Cryptosystems\NTRU TCG Software 
 Stack\bin\;C:\Program Files\NTRU Cryptosystems\NTRU TCG Software 
 Stack\bin\;C:\Program Files (x86)\Intel\OpenCL SDK\2.0\bin\x86;C:\Program 
 Files (x86)\Intel\OpenCL SDK\2.0\bin\x64;C:\Program Files\MiKTeX 
 2.9\miktex\bin\x64\;C:\Program Files 
 (x86)\ActivIdentity\ActivClient\;C:\Program Files\ActivIdentity\ActivClient\



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5529) BlockManager heartbeat expiration does not kill executor

2015-04-28 Thread Alex Rovner (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517298#comment-14517298
 ] 

Alex Rovner commented on SPARK-5529:


Sorry to quickly pulled the trigger... Need to resolve some compilation errors 

 BlockManager heartbeat expiration does not kill executor
 

 Key: SPARK-5529
 URL: https://issues.apache.org/jira/browse/SPARK-5529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.2.0
Reporter: Hong Shen
Assignee: Hong Shen
 Fix For: 1.4.0

 Attachments: SPARK-5529.patch


 When I run a spark job, one executor is hold, after 120s, blockManager is 
 removed by driver, but after half an hour before the executor is remove by  
 driver. Here is the log:
 {code}
 15/02/02 14:58:43 WARN BlockManagerMasterActor: Removing BlockManager 
 BlockManagerId(1, 10.215.143.14, 47234) with no recent heart beats: 147198ms 
 exceeds 12ms
 
 15/02/02 15:26:55 ERROR YarnClientClusterScheduler: Lost executor 1 on 
 10.215.143.14: remote Akka client disassociated
 15/02/02 15:26:55 WARN ReliableDeliverySupervisor: Association with remote 
 system [akka.tcp://sparkExecutor@10.215.143.14:46182] has failed, address is 
 now gated for [5000] ms. Reason is: [Disassociated].
 15/02/02 15:26:55 INFO TaskSetManager: Re-queueing tasks for 1 from TaskSet 
 0.0
 15/02/02 15:26:55 WARN TaskSetManager: Lost task 3.0 in stage 0.0 (TID 3, 
 10.215.143.14): ExecutorLostFailure (executor 1 lost)
 15/02/02 15:26:55 ERROR YarnClientSchedulerBackend: Asked to remove 
 non-existent executor 1
 15/02/02 15:26:55 INFO DAGScheduler: Executor lost: 1 (epoch 0)
 15/02/02 15:26:55 INFO BlockManagerMasterActor: Trying to remove executor 1 
 from BlockManagerMaster.
 15/02/02 15:26:55 INFO BlockManagerMaster: Removed 1 successfully in 
 removeExecutor
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7197) Join with DataFrame Python API not working properly with more than 1 column

2015-04-28 Thread Ali Bajwa (JIRA)

Ali Bajwa created SPARK-7197:


 Summary: Join with DataFrame Python API not working properly with 
more than 1 column
 Key: SPARK-7197
 URL: https://issues.apache.org/jira/browse/SPARK-7197
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.3.1
Reporter: Ali Bajwa


It looks like join with DataFrames API in python does not return correct 
results if using more 2 or more columns.  The example in the documentation
only shows a single column.

Here is an example to show the problem:

Example code

import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'year': ['1993', '2005', '1994'], 'month': ['5',
'12', '12'], 'value': [100, 200, 300]})
a = hc.createDataFrame(A)
B = pd.DataFrame({'year': ['1993', '1993'], 'month': ['12', '12'],
'value': [101, 102]})
b = hc.createDataFrame(B)

print Pandas  # try with Pandas
print A
print B
print pd.merge(A, B, on=['year', 'month'], how='inner')

print Spark
print a.toPandas()
print b.toPandas()
print a.join(b, a.year==b.year and a.month==b.month, 'inner').toPandas()

*Output

Pandas
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

Empty DataFrame

Columns: [month, value_x, year, value_y]

Index: []

Spark
  month  value  year
0 5100  1993
112200  2005
212300  1994

  month  value  year
012101  1993
112102  1993

 month  value  year month  value  year
012200  200512102  1993
112200  200512101  1993
212300  199412102  1993
312300  199412101  1993

It looks like Spark returns some results where an inner join should
return nothing.

Confirmed on user mailing list as an issue with Ayan Guha.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7196) decimal precision lost when loading DataFrame from JDBC

2015-04-28 Thread Ken Geis (JIRA)

Ken Geis created SPARK-7196:
---

 Summary: decimal precision lost when loading DataFrame from JDBC
 Key: SPARK-7196
 URL: https://issues.apache.org/jira/browse/SPARK-7196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ken Geis


I have a decimal database field that is defined as 10.2 (i.e. ##.##). 
When I load it into Spark via sqlContext.jdbc(..), the type of the 
corresponding field in the DataFrame is DecimalType, with precisionInfo None. 
Because of that loss of precision information, SPARK-4176 is triggered when I 
try to .saveAsTable(..).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7140) Do not scan all values in Vector.hashCode


 [ 
https://issues.apache.org/jira/browse/SPARK-7140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7140.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
   1.3.2

Issue resolved by pull request 5697
[https://github.com/apache/spark/pull/5697]

 Do not scan all values in Vector.hashCode
 -

 Key: SPARK-7140
 URL: https://issues.apache.org/jira/browse/SPARK-7140
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.3.2, 1.4.0


 It makes hashCode really expensive. The Pyrolite version we are using in 
 Spark calls it in serialization. Scanning the first few nonzeros should be 
 sufficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517946#comment-14517946
 ] 

koert kuipers edited comment on SPARK-3655 at 4/28/15 8:18 PM:
---

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.



was (Author: koert):
since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517946#comment-14517946
 ] 

koert kuipers edited comment on SPARK-3655 at 4/28/15 8:19 PM:
---

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small spark-sorted library which is available on spark-packages, and 
that's good enough.



was (Author: koert):
since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
{noformat}
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
{noformat}
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages


 [ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7205:
---

Assignee: (was: Apache Spark)

 Support local ivy cache in --packages
 -

 Key: SPARK-7205
 URL: https://issues.apache.org/jira/browse/SPARK-7205
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Burak Yavuz
Priority: Critical
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7205) Support local ivy cache in --packages


[ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517988#comment-14517988
 ] 

Apache Spark commented on SPARK-7205:
-

User 'brkyvz' has created a pull request for this issue:
https://github.com/apache/spark/pull/5755

 Support local ivy cache in --packages
 -

 Key: SPARK-7205
 URL: https://issues.apache.org/jira/browse/SPARK-7205
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Burak Yavuz
Priority: Critical
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7205) Support local ivy cache in --packages


 [ 
https://issues.apache.org/jira/browse/SPARK-7205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7205:
---

Assignee: Apache Spark

 Support local ivy cache in --packages
 -

 Key: SPARK-7205
 URL: https://issues.apache.org/jira/browse/SPARK-7205
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Reporter: Burak Yavuz
Assignee: Apache Spark
Priority: Critical
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7204) Call sites in UI are not accurate for DataFrame operations

2015-04-28 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-7204:
--

 Summary: Call sites in UI are not accurate for DataFrame operations
 Key: SPARK-7204
 URL: https://issues.apache.org/jira/browse/SPARK-7204
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Patrick Wendell
Assignee: Patrick Wendell
Priority: Critical


Spark core computes callsites by climbing up the stack until we reach the stack 
frame at the boundary of user code and spark code. The way we compute whether a 
given frame is internal (Spark) or user code does not work correctly with the 
new dataframe API.

Once the scope work goes in, we'll have a nicer way to express units of 
operator scope, but until then there is a simple fix where we just make sure 
the SQL internal classes are also skipped as we climb up the stack.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5338) Support cluster mode with Mesos

2015-04-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5338:
-
Affects Version/s: 1.0.0

 Support cluster mode with Mesos
 ---

 Key: SPARK-5338
 URL: https://issues.apache.org/jira/browse/SPARK-5338
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Timothy Chen
 Fix For: 1.4.0


 Currently using Spark with Mesos, the only supported deployment is client 
 mode.
 It is also useful to have a cluster mode deployment that can be shared and 
 long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-04-28 Thread koert kuipers (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14517946#comment-14517946
 ] 

koert kuipers commented on SPARK-3655:
--

since the last pullreq for this ticket i created spark-sorted (based on 
suggestions from imran), a small library for spark that supports the target 
features of this ticket, but without the burden of having to be fully 
compatible with the current spark api conventions (with regards to ordering 
being implicit).
i also got a chance to catch up with sandy at spark summit east and we 
exchanged some emails afterward about this jira ticket and possible design 
choices.

so based on those experiences i think there are better alternatives than the 
current pullreq (https://github.com/apache/spark/pull/3632), and i will close 
it. the pullreq does bring secondary sort to spark, but only in memory, which 
is a very limited feature (since if the values can be stored in memory then 
sorting after the shuffle isn't really that hard, just wasteful).

instead of the current pullreq i see 2 alternatives:
1) a new pullreq that introduces the mapStream api, which is very similar to 
the reduce operation as we know it in hadoop: an sorted streaming reduce. Its 
signature would be something like this on RDD[(K, V)]:
  def mapStreamByKey[W](partitioner: Partitioner, f: Iterator[V] = 
Iterator[W])(implicit o1: Ordering[K], o2: Ordering[V]): RDD[(K, W)]
(note that the implicits would not actually be on the method as shown here, but 
on a class conversion, similar to how PairRDDFunctions works.

2) don't to anything. the functionality this jira targets is already available 
in the small smart-sorted library which is available on spark-packages, and 
that's good enough.


 Support sorting of values in addition to keys (i.e. secondary sort)
 ---

 Key: SPARK-3655
 URL: https://issues.apache.org/jira/browse/SPARK-3655
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.1.0, 1.2.0
Reporter: koert kuipers
Assignee: Koert Kuipers

 Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
 There are some use cases where getting a sorted iterator of values per key is 
 helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-5338) Support cluster mode with Mesos

2015-04-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5338.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Timothy Chen
Target Version/s: 1.4.0

 Support cluster mode with Mesos
 ---

 Key: SPARK-5338
 URL: https://issues.apache.org/jira/browse/SPARK-5338
 Project: Spark
  Issue Type: Improvement
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 1.4.0


 Currently using Spark with Mesos, the only supported deployment is client 
 mode.
 It is also useful to have a cluster mode deployment that can be shared and 
 long running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6943) Graphically show RDD's included in a stage

2015-04-28 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518084#comment-14518084
 ] 

Andrew Or commented on SPARK-6943:
--

Yeah ideally we will have the job graph that magnifies into the stage graph. 
I'll see what I can do.

 Graphically show RDD's included in a stage
 --

 Key: SPARK-6943
 URL: https://issues.apache.org/jira/browse/SPARK-6943
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core, SQL
Reporter: Patrick Wendell
Assignee: Andrew Or
 Attachments: DAGvisualizationintheSparkWebUI.pdf, with-closures.png, 
 with-stack-trace.png






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler


[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518141#comment-14518141
 ] 

Joseph K. Bradley commented on SPARK-5556:
--

Great!  I'm not aware of blockers.  As far as other active implementations, the 
only ones I know of are those reference by [~gq] above.  Please do ping him on 
your work and see if there are ideas which can be merged.  We can help with the 
coordination and discussions as well.  Thanks!

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5014) GaussianMixture (GMM) improvements


 [ 
https://issues.apache.org/jira/browse/SPARK-5014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5014:
-
Comment: was deleted

(was: No need for umbrella JIRA)

 GaussianMixture (GMM) improvements
 --

 Key: SPARK-5014
 URL: https://issues.apache.org/jira/browse/SPARK-5014
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 This is an umbrella JIRA for improvements to Gaussian Mixture Models (GMMs).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to all list in linalg.py


 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7208:
---

Assignee: Apache Spark  (was: Joseph K. Bradley)

 Add Matrix, SparseMatrix to __all__ list in linalg.py
 -

 Key: SPARK-7208
 URL: https://issues.apache.org/jira/browse/SPARK-7208
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Apache Spark
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7208) Add Matrix, SparseMatrix to all list in linalg.py


 [ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7208:
---

Assignee: Joseph K. Bradley  (was: Apache Spark)

 Add Matrix, SparseMatrix to __all__ list in linalg.py
 -

 Key: SPARK-7208
 URL: https://issues.apache.org/jira/browse/SPARK-7208
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7209) Adding new Manning book Spark in Action to the official Spark Webpage

2015-04-28 Thread Aleksandar Dragosavljevic (JIRA)

Aleksandar Dragosavljevic created SPARK-7209:


 Summary: Adding new Manning book Spark in Action to the official 
Spark Webpage
 Key: SPARK-7209
 URL: https://issues.apache.org/jira/browse/SPARK-7209
 Project: Spark
  Issue Type: Task
  Components: Documentation
Reporter: Aleksandar Dragosavljevic
Priority: Minor


Manning Publications is developing a book Spark in Action written by Marko 
Bonaci and Petar Zecevic (http://www.manning.com/bonaci) and it would be great 
if the book could be added to the list of books at the official Spark Webpage 
(https://spark.apache.org/documentation.html).

This book teaches readers to use Spark for stream and batch data processing. It 
starts with an introduction to the Spark architecture and ecosystem followed by 
a taste of Spark's command line interface. Readers then discover the most 
fundamental concepts and abstractions of Spark, particularly Resilient 
Distributed Datasets (RDDs) and the basic data transformations that RDDs 
provide. The first part of the book also introduces you to writing Spark 
applications using the the core APIs. Next, you learn about different Spark 
components: how to work with structured data using Spark SQL, how to process 
near-real time data with Spark Streaming, how to apply machine learning 
algorithms with Spark MLlib, how to apply graph algorithms on graph-shaped data 
using Spark GraphX, and a clear introduction to Spark clustering.

The book is already available to the public as a part of our Manning Early 
Access Program (MEAP) where we deliver chapters to the public as soon as they 
are written. We believe it will offer significant support to the Spark users 
and the community.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7208) Add Matrix, SparseMatrix to all list in linalg.py


[ 
https://issues.apache.org/jira/browse/SPARK-7208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518075#comment-14518075
 ] 

Apache Spark commented on SPARK-7208:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/5759

 Add Matrix, SparseMatrix to __all__ list in linalg.py
 -

 Key: SPARK-7208
 URL: https://issues.apache.org/jira/browse/SPARK-7208
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Trivial





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7213) Exception while copying Hadoop config files due to permission issues


[ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14518201#comment-14518201
 ] 

Apache Spark commented on SPARK-7213:
-

User 'nishkamravi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/5760

 Exception while copying Hadoop config files due to permission issues
 

 Key: SPARK-7213
 URL: https://issues.apache.org/jira/browse/SPARK-7213
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-7213) Exception while copying Hadoop config files due to permission issues


 [ 
https://issues.apache.org/jira/browse/SPARK-7213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7213:
---

Assignee: Apache Spark

 Exception while copying Hadoop config files due to permission issues
 

 Key: SPARK-7213
 URL: https://issues.apache.org/jira/browse/SPARK-7213
 Project: Spark
  Issue Type: Bug
Reporter: Nishkam Ravi
Assignee: Apache Spark





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-7208) Add Matrix, SparseMatrix to all list in linalg.py