[jira] [Resolved] (SPARK-958) When iteration in ALS increases to 10 running in local mode, spark throws out error of StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-958. - Resolution: Duplicate When iteration in ALS increases to 10 running in local mode, spark throws out error of StackOverflowError - Key: SPARK-958 URL: https://issues.apache.org/jira/browse/SPARK-958 Project: Spark Issue Type: Bug Reporter: Qiuzhuang Lian I try to use ml-100k data to test ALS running in local mode in mllib project. If I specify iteration to be less than, it works well. However, when iteration is increased to more than 10 iterations, spark throws out error of StackOverflowError. Attached is the log file. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x
Xiangrui Meng created SPARK-1410: Summary: Class not found exception with application launched from sbt 0.13.x Key: SPARK-1410 URL: https://issues.apache.org/jira/browse/SPARK-1410 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng sbt 0.13.x use its own loader but this is not available at worker side: org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@47ed40d A workaround is to switch to sbt 0.12.4. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x
[ https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959363#comment-13959363 ] Xiangrui Meng commented on SPARK-1410: -- The code is available at https://github.com/mengxr/mllib-grid-search If sbt.version is changed to 0.13.0 or 0.13.1, `sbt/sbt run ...` throws the following exception: ~~~ [error] (run-main-0) org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@45307fb org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@45307fb at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1010) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1008) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1008) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:595) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:146) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) ~~~ Class not found exception with application launched from sbt 0.13.x --- Key: SPARK-1410 URL: https://issues.apache.org/jira/browse/SPARK-1410 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng sbt 0.13.x use its own loader but this is not available at worker side: org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: sbt.classpath.ClasspathFilter@47ed40d A workaround is to switch to sbt 0.12.4. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1434) Make labelParser Java friendly.
[ https://issues.apache.org/jira/browse/SPARK-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1434: - Component/s: MLlib Make labelParser Java friendly. --- Key: SPARK-1434 URL: https://issues.apache.org/jira/browse/SPARK-1434 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor Fix For: 1.0.0 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib
[ https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962048#comment-13962048 ] Xiangrui Meng commented on SPARK-1406: -- I think we should support PMML import/export in MLlib. PMML also provides feature transformations, which MLlib has very limited support at this time. The question is 1) how we take leverage on existing PMML packages, 2) how many people volunteer. Sean, it would be super helpful if you can share some experience on Oryx's PMML support, since I'm also not sure about whether this is the right time to start. PMML model evaluation support via MLib -- Key: SPARK-1406 URL: https://issues.apache.org/jira/browse/SPARK-1406 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Thomas Darimont It would be useful if spark would provide support the evaluation of PMML models (http://www.dmg.org/v4-2/GeneralStructure.html). This would allow to use analytical models that were created with a statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which would perform the actual model evaluation for a given input tuple. The PMML model would then just contain the parameterization of an analytical model. Other projects like JPMML-Evaluator do a similar thing. https://github.com/jpmml/jpmml/tree/master/pmml-evaluator -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1218) Minibatch SGD with random sampling
[ https://issues.apache.org/jira/browse/SPARK-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1218. -- Resolution: Fixed Fix Version/s: 0.9.0 Fixed in 0.9.0 or an earlier version. Minibatch SGD with random sampling -- Key: SPARK-1218 URL: https://issues.apache.org/jira/browse/SPARK-1218 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ameet Talwalkar Assignee: Shivaram Venkataraman Fix For: 0.9.0 Takes a gradient function as input. At each iteration, we run stochastic gradient descent locally on each worker with a fraction of the data points selected randomly and with replacement (i.e., sampled points may overlap across iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1217) Add proximal gradient updater.
[ https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1217. -- Resolution: Fixed Fix Version/s: 0.9.0 Add proximal gradient updater. -- Key: SPARK-1217 URL: https://issues.apache.org/jira/browse/SPARK-1217 Project: Spark Issue Type: Bug Components: MLlib Reporter: Ameet Talwalkar Fix For: 0.9.0 Add proximal gradient updater, in particular for L1 regularization. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1219) Minibatch SGD with disjoint partitions
[ https://issues.apache.org/jira/browse/SPARK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1219. -- Resolution: Fixed Implemented in 0.9.0 or an earlier version. Minibatch SGD with disjoint partitions -- Key: SPARK-1219 URL: https://issues.apache.org/jira/browse/SPARK-1219 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Ameet Talwalkar Takes a gradient function as input. At each iteration, we run stochastic gradient descent locally on each worker with a fraction (alpha) of the data points selected randomly and disjointly (i.e., we ensure that we touch all datapoints after at most 1/alpha iterations). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's
[ https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964405#comment-13964405 ] Xiangrui Meng commented on SPARK-1357: -- Hi Sean, Actually, you came in just in time. This was only the first pass, and we are still accepting API visibility/annotation patches during the QA period. MLlib is still a beta component of Spark, so 1.0 doesn't mean it is stable. And we still accept additions (JIRA submitted before April 1) to MLlib, as Patrick announced in the dev mailing list. (I do want to mark all of MLlib experimental to reserve the right to change in the future, but we need to find a balance point here.) I agree that it is future-proof to switch id type from Int to Long in ALS. The extra storage requirement is 8 bytes per rating. Inside ALS, we also re-partition the ratings, which needs extra storage. We need to consider whether we want to switch to Long completely or provide an option to use Long ids. Could you submit a patch, either marking ALS experimental or allowing using Long ids? I don't think String type is necessary because we can alway creates a map between String ids and Long ids. A String id usually costs more than a Long id. For the same reason, classification uses Double for labels. Please submit a patch for APIs you don't feel comfortable to say stable or marked experimental/developer by me but you think the other way. It would be great to keep the discussion going. Thanks! Best, Xiangrui [MLLIB] Annotate developer and experimental API's - Key: SPARK-1357 URL: https://issues.apache.org/jira/browse/SPARK-1357 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Patrick Wendell Assignee: Xiangrui Meng Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1215) Clustering: Index out of bounds error
[ https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969718#comment-13969718 ] Xiangrui Meng commented on SPARK-1215: -- The error was due to small number of points and large k. The k-means|| initialization doesn't collect more than k candidates. This is very unlikely to appear in practice because k is much smaller than number of points. I will re-visit this issue once we implement better weighted sampling algorithms. Clustering: Index out of bounds error - Key: SPARK-1215 URL: https://issues.apache.org/jira/browse/SPARK-1215 Project: Spark Issue Type: Bug Components: MLlib Reporter: dewshick Assignee: Xiangrui Meng Priority: Minor code: import org.apache.spark.mllib.clustering._ val test = sc.makeRDD(Array(4,4,4,4,4).map(e = Array(e.toDouble))) val kmeans = new KMeans().setK(4) kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException error: 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at KMeans.scala:243) finished in 0.047 s 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at KMeans.scala:243, took 16.389537116 s Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.simontuffs.onejar.Boot.run(Boot.java:340) at com.simontuffs.onejar.Boot.main(Boot.java:166) Caused by: java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247) at org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.immutable.Range.foreach(Range.scala:81) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.Range.map(Range.scala:46) at org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124) at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21) at Clustering$$anonfun$1.apply(Clustering.scala:19) at Clustering$$anonfun$1.apply(Clustering.scala:19) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.immutable.Range.foreach(Range.scala:78) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.immutable.Range.map(Range.scala:46) at Clustering$.main(Clustering.scala:19) at Clustering.main(Clustering.scala) ... 6 more -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1503) Implement Nesterov's accelerated first-order method
Xiangrui Meng created SPARK-1503: Summary: Implement Nesterov's accelerated first-order method Key: SPARK-1503 URL: https://issues.apache.org/jira/browse/SPARK-1503 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Nesterov's accelerated first-order method is a drop-in replacement for steepest descent but it converges much faster. We should implement this method and compare its performance with existing algorithms, including SGD and L-BFGS. TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's method and its variants on composite objectives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1506) Documentation improvements for MLlib 1.0
Xiangrui Meng created SPARK-1506: Summary: Documentation improvements for MLlib 1.0 Key: SPARK-1506 URL: https://issues.apache.org/jira/browse/SPARK-1506 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Blocker Proposed TOC: Linear algebra * vector and matrix * distributed matrix Classification and regression * generalized linear models * support vector machine * decision tree * naive Bayes Collaborative filtering * alternating least squares (ALS) Clustering * k-means Dimension reduction * principal component analysis (PCA) Optimization * stochastic gradient descent * limited-memory BFGS -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973291#comment-13973291 ] Xiangrui Meng commented on SPARK-1520: -- I'm using Java 6 JDK located at /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home on a mac. It can create a jar with more than 65536 files. I also found this JIRA: https://bugs.openjdk.java.net/browse/JDK-4828461 (Support Zip files with more than 64k entries) which was fixed in version 6. Note that this is for openjdk. I'm going to check the headers of assembly jars created by java 6 and 7. Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. *Isolation* -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- I've found that if I just unpack and re-pack the jar (using `jar` from java 6 or 7) it always works: {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973306#comment-13973306 ] Xiangrui Meng commented on SPARK-1520: -- When I try to use jar-1.6 to untar the assembly jar created by java 7: ~~~ java.util.zip.ZipException: invalid CEN header (bad signature) at java.util.zip.ZipFile.open(Native Method) at java.util.zip.ZipFile.init(ZipFile.java:128) at java.util.zip.ZipFile.init(ZipFile.java:89) at sun.tools.jar.Main.list(Main.java:977) at sun.tools.jar.Main.run(Main.java:222) at sun.tools.jar.Main.main(Main.java:1147) ~~~ 7z shows: ~~~ Path = spark-assembly-1.6.jar Type = zip Physical Size = 119682511 Path = spark-assembly-1.7.jar Type = zip 64-bit = + Physical Size = 119682587 ~~~ I think the number of files limit is already increased in Java 6 (at least in the latest update), but Java 7 will use zip64 format for more than 64k files, and this format cannot be recognized by Java 6. Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. *Isolation* -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- I've found that if I just unpack and re-pack the jar (using `jar` from java 6 or 7) it always works: {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973326#comment-13973326 ] Xiangrui Meng edited comment on SPARK-1520 at 4/17/14 7:59 PM: --- The quick fix may be removing fastutil, so Java 7 still generates the assembly jar in zip format instead of zip64. In RDD#countApproxDistinct, we use HyperLogLog from com.clearspring.analytics:stream, which depends on fastutil. If this is the only place that introduces fastutil dependency, we should implement HyperLogLog and remove fastutil completely from Spark's dependencies. was (Author: mengxr): The quick fix may be removing fastutil. In RDD#countApproxDistinct, we use HyperLogLog from com.clearspring.analytics:stream, which depends on fastutil. If this is the only place that introduces fastutil dependency, we should implement HyperLogLog and remove fastutil completely from Spark's dependencies. Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. *Isolation* -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- I've found that if I just unpack and re-pack the jar (using `jar` from java 6 or 7) it always works: {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973326#comment-13973326 ] Xiangrui Meng commented on SPARK-1520: -- The quick fix may be removing fastutil. In RDD#countApproxDistinct, we use HyperLogLog from com.clearspring.analytics:stream, which depends on fastutil. If this is the only place that introduces fastutil dependency, we should implement HyperLogLog and remove fastutil completely from Spark's dependencies. Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. *Isolation* -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- I've found that if I just unpack and re-pack the jar (using `jar` from java 6 or 7) it always works: {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973463#comment-13973463 ] Xiangrui Meng commented on SPARK-1520: -- It seems HyperLogLog doesn't need fastutil, so we can exclude fastutil directly. Will send a patch. Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. h1. Isolation and Cause The package-time behavior of Java 6 and 7 differ with respect to the format used for jar files: ||Number of entries||JDK 6||JDK 7|| |= 65536|zip|zip| | 65536|zip*|zip64| zip* is a workaround for the original zip format that [described in JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows some versions of Java 6 to support larger assembly jars. The Scala libraries we depend on have added a large number of classes which bumped us over the limit. This causes the Java 7 packaging to not work with Java 6. We can probably go back under the limit by clearing out some accidental inclusion of FastUtil, but eventually we'll go over again. The real answer is to force people to build with JDK 6 if they want to run Spark on JRE 6. -I've found that if I just unpack and re-pack the jar (using `jar`) it always works:- {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} -I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself.- -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1464) Update MLLib Examples to Use Breeze
[ https://issues.apache.org/jira/browse/SPARK-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1464. -- Resolution: Duplicate Update MLLib Examples to Use Breeze --- Key: SPARK-1464 URL: https://issues.apache.org/jira/browse/SPARK-1464 Project: Spark Issue Type: Task Components: MLlib Reporter: Patrick Wendell Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.0 If we want to deprecate the vector class we need to update all of the examples to use Breeze. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Assigned] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-1520: Assignee: Xiangrui Meng Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. h1. Isolation and Cause The package-time behavior of Java 6 and 7 differ with respect to the format used for jar files: ||Number of entries||JDK 6||JDK 7|| |= 65536|zip|zip| | 65536|zip*|zip64| zip* is a workaround for the original zip format that [described in JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows some versions of Java 6 to support larger assembly jars. The Scala libraries we depend on have added a large number of classes which bumped us over the limit. This causes the Java 7 packaging to not work with Java 6. We can probably go back under the limit by clearing out some accidental inclusion of FastUtil, but eventually we'll go over again. The real answer is to force people to build with JDK 6 if they want to run Spark on JRE 6. -I've found that if I just unpack and re-pack the jar (using `jar`) it always works:- {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} -I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself.- -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1533) The (kill) button in the web UI is visible to everyone.
Xiangrui Meng created SPARK-1533: Summary: The (kill) button in the web UI is visible to everyone. Key: SPARK-1533 URL: https://issues.apache.org/jira/browse/SPARK-1533 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Xiangrui Meng We can kill jobs from web UI now, which is great. But there is no authentication in the standalone mode, e.g., clusters created by spark-ec2. Then everyone can visit a standalone server and kill jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1485) Implement AllReduce
[ https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1485: - Affects Version/s: (was: 1.0.0) Implement AllReduce --- Key: SPARK-1485 URL: https://issues.apache.org/jira/browse/SPARK-1485 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This will create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce (or AllAggregate) can help free the driver: allReduce(RDD[T], (T, T) = T): RDD[T] This JIRA is created for discussing how to implement AllReduce efficiently and possible alternatives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1561) sbt/sbt assembly generates too many local files
Xiangrui Meng created SPARK-1561: Summary: sbt/sbt assembly generates too many local files Key: SPARK-1561 URL: https://issues.apache.org/jira/browse/SPARK-1561 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng Running `find ./ | wc -l` after `sbt/sbt assembly` returned 564365 This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1561) sbt/sbt assembly generates too many local files
[ https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1561: - Description: Running `find ./ | wc -l` after `sbt/sbt assembly` returned This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. was: Running `find ./ | wc -l` after `sbt/sbt assembly` returned 564365 This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. sbt/sbt assembly generates too many local files --- Key: SPARK-1561 URL: https://issues.apache.org/jira/browse/SPARK-1561 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng Running `find ./ | wc -l` after `sbt/sbt assembly` returned This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1561) sbt/sbt assembly generates too many local files
[ https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976487#comment-13976487 ] Xiangrui Meng commented on SPARK-1561: -- Tried adding {code} assemblyOption in assembly ~= { _.copy(cacheUnzip = false) }, assemblyOption in assembly ~= { _.copy(cacheOutput = false) }, {code} to SparkBuild.scala, but it didn't help reduce the number of files. sbt/sbt assembly generates too many local files --- Key: SPARK-1561 URL: https://issues.apache.org/jira/browse/SPARK-1561 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng Running `find ./ | wc -l` after `sbt/sbt assembly` returned 564365 This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1561) sbt/sbt assembly generates too many local files
[ https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1561: - Description: Running `find ./ | wc -l` after `sbt/sbt assembly` returned 564365 This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. was: Running `find ./ | wc -l` after `sbt/sbt assembly` returned This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. sbt/sbt assembly generates too many local files --- Key: SPARK-1561 URL: https://issues.apache.org/jira/browse/SPARK-1561 Project: Spark Issue Type: Improvement Affects Versions: 1.0.0 Reporter: Xiangrui Meng Running `find ./ | wc -l` after `sbt/sbt assembly` returned 564365 This hits the default limit of #INode of an 8GB EXT FS (the default volume size for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` on such a partition. Most of the small files are under assembly/target/streams and the same folder under examples/. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1595) Remove VectorRDDs
[ https://issues.apache.org/jira/browse/SPARK-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1595: - Affects Version/s: 1.0.0 Remove VectorRDDs - Key: SPARK-1595 URL: https://issues.apache.org/jira/browse/SPARK-1595 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng This is basically an RDD#map with Vectors.dense, maybe useful for Java users but it is simple to write one. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1599) Allow to use intercept in Ridge and Lasso
Xiangrui Meng created SPARK-1599: Summary: Allow to use intercept in Ridge and Lasso Key: SPARK-1599 URL: https://issues.apache.org/jira/browse/SPARK-1599 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng AddIntercept in Ridge and Lasso was disabled in 0.9.1 for a quick fix. For 1.0, we should remove this restriction. Penalizing the intercept variable may not be the right thing to do. We can solve this problem by adding weighted regularization later. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1634) Java API docs contain test cases
[ https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1634: - Summary: Java API docs contain test cases (was: JavaDoc contains test cases) Java API docs contain test cases Key: SPARK-1634 URL: https://issues.apache.org/jira/browse/SPARK-1634 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Blocker The generated Java API docs contains all test cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1634) Java API docs contain test cases
[ https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1634: - Description: The generated Java API docs contain all test cases. (was: The generated Java API docs contains all test cases.) Java API docs contain test cases Key: SPARK-1634 URL: https://issues.apache.org/jira/browse/SPARK-1634 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Blocker The generated Java API docs contain all test cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1635) Java API docs do not show annotation.
Xiangrui Meng created SPARK-1635: Summary: Java API docs do not show annotation. Key: SPARK-1635 URL: https://issues.apache.org/jira/browse/SPARK-1635 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Xiangrui Meng The generated Java API docs do not contain Developer/Experimental annotations. The :: Developer/Experimental :: tag is in the generated doc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1636) Move main methods to examples
Xiangrui Meng created SPARK-1636: Summary: Move main methods to examples Key: SPARK-1636 URL: https://issues.apache.org/jira/browse/SPARK-1636 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Move the main methods to examples and make them compatible with spark-submit. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1598) Mark main methods experimental
[ https://issues.apache.org/jira/browse/SPARK-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1598. -- Resolution: Duplicate We will move main methods to examples instead. Mark main methods experimental -- Key: SPARK-1598 URL: https://issues.apache.org/jira/browse/SPARK-1598 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We should treat the parameters in the main methods as part of our APIs. They are not quite consistent at this time, so we should mark them experimental and look for a unified solution in the next sprint. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-1634) Java API docs contain test cases
[ https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-1634. Resolution: Not a Problem Fix Version/s: 1.0.0 Assignee: Xiangrui Meng Re-tried with 'sbt/sbt clean`. The generated docs for test cases were gone. Java API docs contain test cases Key: SPARK-1634 URL: https://issues.apache.org/jira/browse/SPARK-1634 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.0 The generated Java API docs contain all test cases. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1668) Add implicit preference as an option to examples/MovieLensALS
Xiangrui Meng created SPARK-1668: Summary: Add implicit preference as an option to examples/MovieLensALS Key: SPARK-1668 URL: https://issues.apache.org/jira/browse/SPARK-1668 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Priority: Minor Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/. For evaluation, we should map ratings to range [0, 1] and compare it with predictions. It would be better if we add unobserved ratings (assuming negatives) to evaluation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe
Xiangrui Meng created SPARK-1674: Summary: Interrupted system call error in pyspark's RDD.pipe Key: SPARK-1674 URL: https://issues.apache.org/jira/browse/SPARK-1674 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng RDD.pipe's doctest throws interrupted system call exception on Mac. It can be fixed by wrapping pipe.stdout.readline in an iterator. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe
[ https://issues.apache.org/jira/browse/SPARK-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1674. -- Resolution: Fixed Fix Version/s: 1.0.0 Interrupted system call error in pyspark's RDD.pipe --- Key: SPARK-1674 URL: https://issues.apache.org/jira/browse/SPARK-1674 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 RDD.pipe's doctest throws interrupted system call exception on Mac. It can be fixed by wrapping pipe.stdout.readline in an iterator. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6
[ https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989206#comment-13989206 ] Xiangrui Meng commented on SPARK-1520: -- Koert, which JDK6 did you use? This problem was fixed in a later version of openjdk-6. If you are using openjdk-6, you can try upgrading it to the latest version and see whether the problem still exists. Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6 - Key: SPARK-1520 URL: https://issues.apache.org/jira/browse/SPARK-1520 Project: Spark Issue Type: Bug Components: MLlib, Spark Core Reporter: Patrick Wendell Assignee: Xiangrui Meng Priority: Blocker Fix For: 1.0.0 This is a real doozie - when compiling a Spark assembly with JDK7, the produced jar does not work well with JRE6. I confirmed the byte code being produced is JDK 6 compatible (major version 50). What happens is that, silently, the JRE will not load any class files from the assembled jar. {code} $ sbt/sbt assembly/assembly $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] [FIFO|FAIR] $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator Exception in thread main java.lang.NoClassDefFoundError: org/apache/spark/ui/UIWorkloadGenerator Caused by: java.lang.ClassNotFoundException: org.apache.spark.ui.UIWorkloadGenerator at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. Program will exit. {code} I also noticed that if the jar is unzipped, and the classpath set to the currently directory, it just works. Finally, if the assembly jar is compiled with JDK6, it also works. The error is seen with any class, not just the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only in master. h1. Isolation and Cause The package-time behavior of Java 6 and 7 differ with respect to the format used for jar files: ||Number of entries||JDK 6||JDK 7|| |= 65536|zip|zip| | 65536|zip*|zip64| zip* is a workaround for the original zip format that [described in JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows some versions of Java 6 to support larger assembly jars. The Scala libraries we depend on have added a large number of classes which bumped us over the limit. This causes the Java 7 packaging to not work with Java 6. We can probably go back under the limit by clearing out some accidental inclusion of FastUtil, but eventually we'll go over again. The real answer is to force people to build with JDK 6 if they want to run Spark on JRE 6. -I've found that if I just unpack and re-pack the jar (using `jar`) it always works:- {code} $ cd assembly/target/scala-2.10/ $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # fails $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar * $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar org.apache.spark.ui.UIWorkloadGenerator # succeeds {code} -I also noticed something of note. The Breeze package contains single directories that have huge numbers of files in them (e.g. 2000+ class files in one directory). It's possible we are hitting some weird bugs/corner cases with compatibility of the internal storage format of the jar itself.- -I narrowed this down specifically to the inclusion of the breeze library. Just adding breeze to an older (unaffected) build triggered the issue.- -I ran a git bisection and this appeared after the MLLib sparse vector patch was merged:- https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534 SPARK-1212 -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1723) Add saveAsLibSVMFile
Xiangrui Meng created SPARK-1723: Summary: Add saveAsLibSVMFile Key: SPARK-1723 URL: https://issues.apache.org/jira/browse/SPARK-1723 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Provide a method to save labeled data in LIBSVM format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1724) Add appendBias
Xiangrui Meng created SPARK-1724: Summary: Add appendBias Key: SPARK-1724 URL: https://issues.apache.org/jira/browse/SPARK-1724 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Add `appendBias` to MLUtils to avoid adding the bias term in computation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1595) Remove VectorRDDs
[ https://issues.apache.org/jira/browse/SPARK-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1595. -- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/524 Remove VectorRDDs - Key: SPARK-1595 URL: https://issues.apache.org/jira/browse/SPARK-1595 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 This is basically an RDD#map with Vectors.dense, maybe useful for Java users but it is simple to write one. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1723) Add saveAsLibSVMFile
[ https://issues.apache.org/jira/browse/SPARK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1723. -- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/524 Add saveAsLibSVMFile Key: SPARK-1723 URL: https://issues.apache.org/jira/browse/SPARK-1723 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 Provide a method to save labeled data in LIBSVM format. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1596) Re-arrange public methods in evaluation.
[ https://issues.apache.org/jira/browse/SPARK-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1596. -- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/524 Re-arrange public methods in evaluation. Key: SPARK-1596 URL: https://issues.apache.org/jira/browse/SPARK-1596 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 binary.BinaryClassificationMetrics is the only public API under evaluation. We should move it one level up, and possibly clean the names. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1724) Add appendBias
[ https://issues.apache.org/jira/browse/SPARK-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1724. -- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/524 Add appendBias -- Key: SPARK-1724 URL: https://issues.apache.org/jira/browse/SPARK-1724 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 Add `appendBias` to MLUtils to avoid adding the bias term in computation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1599) Allow to use intercept in Ridge and Lasso
[ https://issues.apache.org/jira/browse/SPARK-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1599. -- Resolution: Fixed Fix Version/s: 1.0.0 https://github.com/apache/spark/pull/524 Allow to use intercept in Ridge and Lasso - Key: SPARK-1599 URL: https://issues.apache.org/jira/browse/SPARK-1599 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.0.0 AddIntercept in Ridge and Lasso was disabled in 0.9.1 for a quick fix. For 1.0, we should remove this restriction. Penalizing the intercept variable may not be the right thing to do. We can solve this problem by adding weighted regularization later. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1741) Add predict(JavaRDD) to predictive models
Xiangrui Meng created SPARK-1741: Summary: Add predict(JavaRDD) to predictive models Key: SPARK-1741 URL: https://issues.apache.org/jira/browse/SPARK-1741 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng `model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1743) Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark
Xiangrui Meng created SPARK-1743: Summary: Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark Key: SPARK-1743 URL: https://issues.apache.org/jira/browse/SPARK-1743 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Make loading/saving labeled data easier for pyspark users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1743) Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark
[ https://issues.apache.org/jira/browse/SPARK-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991373#comment-13991373 ] Xiangrui Meng commented on SPARK-1743: -- PR: https://github.com/apache/spark/pull/672 Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark Key: SPARK-1743 URL: https://issues.apache.org/jira/browse/SPARK-1743 Project: Spark Issue Type: New Feature Components: MLlib, PySpark Reporter: Xiangrui Meng Assignee: Xiangrui Meng Make loading/saving labeled data easier for pyspark users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1741) Add predict(JavaRDD) to predictive models
[ https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991378#comment-13991378 ] Xiangrui Meng commented on SPARK-1741: -- https://github.com/apache/spark/pull/670 Add predict(JavaRDD) to predictive models - Key: SPARK-1741 URL: https://issues.apache.org/jira/browse/SPARK-1741 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng `model.predict` returns a RDD of Scala primitive type (Int/Double), which is recognized as Object in Java. Adding predict(JavaRDD) could make life easier for Java users. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1783) Title contains html code in MLlib guide
Xiangrui Meng created SPARK-1783: Summary: Title contains html code in MLlib guide Key: SPARK-1783 URL: https://issues.apache.org/jira/browse/SPARK-1783 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Minor We use --- layout: global title: a href=mllib-guide.htmlMLlib/a - Clustering --- to create a link in the title to the main page of MLlib's guide. However, the generated title contains raw html code, which shows up in the tab or title bar of the browser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents centers data
[ https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998103#comment-13998103 ] Xiangrui Meng commented on SPARK-1675: -- Centering in PCA should be the standard practice. Make clear whether computePrincipalComponents centers data -- Key: SPARK-1675 URL: https://issues.apache.org/jira/browse/SPARK-1675 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.0.0 Reporter: Sandy Ryza Assignee: Sandy Ryza -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1668) Add implicit preference as an option to examples/MovieLensALS
[ https://issues.apache.org/jira/browse/SPARK-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1668. -- Resolution: Implemented Fix Version/s: 1.0.0 Add implicit preference as an option to examples/MovieLensALS - Key: SPARK-1668 URL: https://issues.apache.org/jira/browse/SPARK-1668 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Sandeep Singh Priority: Minor Fix For: 1.0.0 Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/. For evaluation, we should map ratings to range [0, 1] and compare it with predictions. It would be better if we add unobserved ratings (assuming negatives) to evaluation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1635) Java API docs do not show annotation.
[ https://issues.apache.org/jira/browse/SPARK-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1635: - Priority: Minor (was: Major) Java API docs do not show annotation. - Key: SPARK-1635 URL: https://issues.apache.org/jira/browse/SPARK-1635 Project: Spark Issue Type: Bug Components: Documentation Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Minor The generated Java API docs do not contain Developer/Experimental annotations. The :: Developer/Experimental :: tag is in the generated doc. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1696) RowMatrix.dspr is not using parameter alpha for DenseVector
[ https://issues.apache.org/jira/browse/SPARK-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998127#comment-13998127 ] Xiangrui Meng commented on SPARK-1696: -- Thanks! I sent a PR: https://github.com/apache/spark/pull/778 RowMatrix.dspr is not using parameter alpha for DenseVector --- Key: SPARK-1696 URL: https://issues.apache.org/jira/browse/SPARK-1696 Project: Spark Issue Type: Bug Components: MLlib Reporter: Anish Patel Assignee: Xiangrui Meng Priority: Minor In the master branch, method dspr of RowMatrix takes parameter alpha, but does not use it when given a DenseVector. This probably slid by because when method computeGramianMatrix calls dspr, it provides an alpha value of 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1696) RowMatrix.dspr is not using parameter alpha for DenseVector
[ https://issues.apache.org/jira/browse/SPARK-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1696. -- Resolution: Fixed Fix Version/s: 1.0.0 RowMatrix.dspr is not using parameter alpha for DenseVector --- Key: SPARK-1696 URL: https://issues.apache.org/jira/browse/SPARK-1696 Project: Spark Issue Type: Bug Components: MLlib Reporter: Anish Patel Assignee: Xiangrui Meng Priority: Minor Fix For: 1.0.0 In the master branch, method dspr of RowMatrix takes parameter alpha, but does not use it when given a DenseVector. This probably slid by because when method computeGramianMatrix calls dspr, it provides an alpha value of 1.0. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1605) Improve mllib.linalg.Vector
[ https://issues.apache.org/jira/browse/SPARK-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998106#comment-13998106 ] Xiangrui Meng commented on SPARK-1605: -- `toBreeze` exposes a breeze type. We might want to mark it DeveloperApi and make it public, but I'm not sure whether we should do that in v1.0. Given a `mllib.linalg.Vector`, you can call `toArray` to get the values or operate directly on DenseVector/SparseVector. Improve mllib.linalg.Vector --- Key: SPARK-1605 URL: https://issues.apache.org/jira/browse/SPARK-1605 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sandeep Singh We can make current Vector a wrapper around Breeze.linalg.Vector ? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1646) ALS micro-optimisation
[ https://issues.apache.org/jira/browse/SPARK-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1646. -- Resolution: Implemented Fix Version/s: 1.0.0 PR: https://github.com/apache/spark/pull/568 ALS micro-optimisation -- Key: SPARK-1646 URL: https://issues.apache.org/jira/browse/SPARK-1646 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tor Myklebust Assignee: Tor Myklebust Priority: Trivial Fix For: 1.0.0 Scala for loop bodies turn into methods and the loops themselves into repeated invocations of the body method. This may make Hotspot make poor optimisation decisions. (Xiangrui mentioned that there was a speed improvement from doing similar transformations elsewhere.) The loops on i and p in the ALS training code are prime candidates for this transformation, as is the foreach loop doing regularisation. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1359) SGD implementation is not efficient
[ https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1359: - Affects Version/s: 1.0.0 SGD implementation is not efficient --- Key: SPARK-1359 URL: https://issues.apache.org/jira/browse/SPARK-1359 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 0.9.0, 1.0.0 Reporter: Xiangrui Meng The SGD implementation samples a mini-batch to compute the stochastic gradient. This is not efficient because examples are provided via an iterator interface. We have to scan all of them to obtain a sample. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1485) Implement AllReduce
[ https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1485: - Priority: Critical (was: Major) Implement AllReduce --- Key: SPARK-1485 URL: https://issues.apache.org/jira/browse/SPARK-1485 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.1.0 The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This will create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce (or AllAggregate) can help free the driver: allReduce(RDD[T], (T, T) = T): RDD[T] This JIRA is created for discussing how to implement AllReduce efficiently and possible alternatives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999499#comment-13999499 ] Xiangrui Meng commented on SPARK-1782: -- Btw, this approach only gives us \Sigma and V. If we compute U via A V \Sigma^{-1} (current implementation in MLlib), very likely we lose orthogonality. For sparse SVD, the best package is PROPACK, which implements Lanczos bidiagonalization with partial reorthogonalization (http://sun.stanford.edu/~rmunk/PROPACK/). But let us use ARPACK now since we can call it from Breeze. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1752) Standardize input/output format for vectors and labeled points
[ https://issues.apache.org/jira/browse/SPARK-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1752: - Fix Version/s: 1.1.0 Standardize input/output format for vectors and labeled points -- Key: SPARK-1752 URL: https://issues.apache.org/jira/browse/SPARK-1752 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Fix For: 1.1.0 We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following: 1. dense vector: [v0,v1,..] 2. sparse vector: (size,[i0,i1],[v0,v1]) 3. labeled point: (label,vector) where (..) indicates a tuple and [...] indicate an array. Those are compatible with Python's syntax and can be easily parsed using `eval`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares
[ https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1553: - Fix Version/s: 1.1.0 Support alternating nonnegative least-squares - Key: SPARK-1553 URL: https://issues.apache.org/jira/browse/SPARK-1553 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 0.9.0 Reporter: Tor Myklebust Assignee: Tor Myklebust Fix For: 1.1.0 There's already an ALS implementation. It can be tweaked to support nonnegative least-squares by conditionally running a nonnegative least-squares solve instead of a least-squares solver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1585) Not robust Lasso causes Infinity on weights and losses
[ https://issues.apache.org/jira/browse/SPARK-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999123#comment-13999123 ] Xiangrui Meng commented on SPARK-1585: -- I think the gradient should pull the weights back. If I'm wrong, could you create an example code to demonstrate the problem? -Xiangrui Not robust Lasso causes Infinity on weights and losses -- Key: SPARK-1585 URL: https://issues.apache.org/jira/browse/SPARK-1585 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Reporter: Xusen Yin Assignee: Xusen Yin Fix For: 1.1.0 Lasso uses LeastSquaresGradient and L1Updater, but diff = brzWeights.dot(brzData) - label in LeastSquaresGradient would cause too big diff, then will affect the L1Updater, which increases weights exponentially. Small shrinkage value cannot lasso weights back to zero then. Finally, the weights and losses reach Infinity. For example, data = (0.5 repeats 10k times), weights = (0.6 repeats 10k times), then data.dot(weights) approximates 300+, the diff will be 300. Then L1Updater sets weights to approximate 300. In the next iteration, the weights will be set to approximate 3, and so on. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
Xiangrui Meng created SPARK-1855: Summary: Provide memory-and-local-disk RDD checkpointing Key: SPARK-1855 URL: https://issues.apache.org/jira/browse/SPARK-1855 Project: Spark Issue Type: New Feature Components: MLlib, Spark Core Affects Versions: 1.0.0 Reporter: Xiangrui Meng Checkpointing is used to cut long lineage while maintaining fault tolerance. The current implementation is HDFS-based. Using the BlockRDD we can create in-memory-and-local-disk (with replication) checkpoints that are not as reliable as HDFS-based solution but faster. It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1485) Implement AllReduce
[ https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1485: - Fix Version/s: 1.1.0 Implement AllReduce --- Key: SPARK-1485 URL: https://issues.apache.org/jira/browse/SPARK-1485 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.1.0 The current implementations of machine learning algorithms rely on the driver for some computation and data broadcasting. This will create a bottleneck at the driver for both computation and communication, especially in multi-model training. An efficient implementation of AllReduce (or AllAggregate) can help free the driver: allReduce(RDD[T], (T, T) = T): RDD[T] This JIRA is created for discussing how to implement AllReduce efficiently and possible alternatives. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999493#comment-13999493 ] Xiangrui Meng commented on SPARK-1782: -- This sounds good to me. Let's assume that A is m-by-n where n is small (1e6). Note that (A^T A) x = (A^T A)^T x can be computed in a single pass, which is sum_i (a_i^T x) a_i . So we don't need to implement A^T y, which simplifies the task. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner
[ https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1580: - Fix Version/s: 1.1.0 ALS: Estimate communication and computation costs given a partitioner - Key: SPARK-1580 URL: https://issues.apache.org/jira/browse/SPARK-1580 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Tor Myklebust Priority: Minor Fix For: 1.1.0 It would be nice to be able to estimate the amount of work needed to solve an ALS problem. The chief components of this work are computation time---time spent forming and solving the least squares problems---and communication cost---the number of bytes sent across the network. Communication cost depends heavily on how the users and products are partitioned. We currently do not try to cluster users or products so that fewer feature vectors need to be communicated. This is intended as a first step toward that end---we ought to be able to tell whether one partitioning is better than another. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1856) Standardize MLlib interfaces
Xiangrui Meng created SPARK-1856: Summary: Standardize MLlib interfaces Key: SPARK-1856 URL: https://issues.apache.org/jira/browse/SPARK-1856 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng Priority: Critical Fix For: 1.1.0 Instead of expanding MLlib based on the current class naming scheme (ProblemWithAlgorithm), we should standardize MLlib's interfaces that clearly separate datasets, formulations, algorithms, parameter sets, and models. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files
Xiangrui Meng created SPARK-1861: Summary: ArrayIndexOutOfBoundsException when reading bzip2 files Key: SPARK-1861 URL: https://issues.apache.org/jira/browse/SPARK-1861 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Xiangrui Meng Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe and Spark may run multiple tasks in the same JVM, which leads to this error. This is not a problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a standalone cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1486: - Priority: Critical (was: Major) Support multi-model training in MLlib - Key: SPARK-1486 URL: https://issues.apache.org/jira/browse/SPARK-1486 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Critical Fix For: 1.1.0 It is rare in practice to train just one model with a given set of parameters. Usually, this is done by training multiple models with different sets of parameters and then select the best based on their performance on the validation set. MLlib should provide native support for multi-model training/scoring. It requires decoupling of concepts like problem, formulation, algorithm, parameter set, and model, which are missing in MLlib now. MLI implements similar concepts, which we can borrow. There are different approaches for multi-model training: 0) Keep one copy of the data, and train models one after another (or maybe in parallel, depending on the scheduler). 1) Keep one copy of the data, and train multiple models at the same time (similar to `runs` in KMeans). 2) Make multiple copies of the data (still stored distributively), and use more cores to distribute the work. 3) Collect the data, make the entire dataset available on workers, and train one or more models on each worker. Users should be able to choose which execution mode they want to use. Note that 3) could cover many use cases in practice when the training data is not huge, e.g., 1GB. This task will be divided into sub-tasks and this JIRA is created to discuss the design and track the overall progress. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares
[ https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1553: - Priority: Major (was: Minor) Support alternating nonnegative least-squares - Key: SPARK-1553 URL: https://issues.apache.org/jira/browse/SPARK-1553 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 0.9.0 Reporter: Tor Myklebust Assignee: Tor Myklebust Fix For: 1.1.0 There's already an ALS implementation. It can be tweaked to support nonnegative least-squares by conditionally running a nonnegative least-squares solve instead of a least-squares solver. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK
[ https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000620#comment-14000620 ] Xiangrui Meng commented on SPARK-1782: -- If you need the the latest Breeze to use eigs, I would prefer calling ARPACK directly. svd for sparse matrix using ARPACK -- Key: SPARK-1782 URL: https://issues.apache.org/jira/browse/SPARK-1782 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Li Pu Original Estimate: 672h Remaining Estimate: 672h Currently the svd implementation in mllib calls the dense matrix svd in breeze, which has a limitation of fitting n^2 Gram matrix entries in memory (n is the number of rows or number of columns of the matrix, whichever is smaller). In many use cases, the original matrix is sparse but the Gram matrix might not, and we often need only the largest k singular values/vectors. To make svd really scalable, the memory usage must be propositional to the non-zero entries in the matrix. One solution is to call the de facto standard eigen-decomposition package ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a reverse communication interface. The user provides a function to multiply a square matrix to be decomposed with a dense vector provided by ARPACK, and return the resulting dense vector to ARPACK. Inside ARPACK it uses an Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we need to provide are two matrix-vector multiplications, first M*x then M^t*x. These multiplications can be done in Spark in a distributed manner. The working memory used by ARPACK is O(n*k). When k (the number of desired singular values) is small, it can be easily fit into the memory of the master machine. The overall model is master machine runs ARPACK, and distribute matrix-vector multiplication onto working executors in each iteration. I made a PR to breeze with an ARPACK-backed svds interface (https://github.com/scalanlp/breeze/pull/240). The interface takes anything that can be multiplied by a DenseVector. On Spark/milib side, just need to implement the sparsematrix-vector multiplication. It might take some time to optimize and fully test this implementation, so set the workload estimate to 4 weeks. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files
[ https://issues.apache.org/jira/browse/SPARK-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000677#comment-14000677 ] Xiangrui Meng commented on SPARK-1861: -- Patch available at https://issues.apache.org/jira/browse/HADOOP-10614 . ArrayIndexOutOfBoundsException when reading bzip2 files --- Key: SPARK-1861 URL: https://issues.apache.org/jira/browse/SPARK-1861 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe and Spark may run multiple tasks in the same JVM, which leads to this error. This is not a problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a standalone cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results
[ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136 ] Xiangrui Meng edited comment on SPARK-1859 at 5/18/14 5:27 PM: --- The step size should be smaller than 1 over the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned. scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example. was (Author: mengxr): The step size should be smaller than the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned. scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example. Linear, Ridge and Lasso Regressions with SGD yield unexpected results - Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov Labels: algorithm, machine_learning, regression Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results
[ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136 ] Xiangrui Meng commented on SPARK-1859: -- The step size should be smaller than the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned. scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example. Linear, Ridge and Lasso Regressions with SGD yield unexpected results - Key: SPARK-1859 URL: https://issues.apache.org/jira/browse/SPARK-1859 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 0.9.1 Environment: OS: Ubuntu Server 12.04 x64 PySpark Reporter: Vlad Frolov Labels: algorithm, machine_learning, regression Issue: Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one). Ridge Regression with SGD *sometimes* works ok. Lasso Regression with SGD *sometimes* works ok. Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 : {code:title=regression_example.py} parsedData = sc.parallelize([ array([2400., 1500.]), array([240., 150.]), array([24., 15.]), array([2.4, 1.5]), array([0.24, 0.15]) ]) # Build the model model = LinearRegressionWithSGD.train(parsedData) print model._coeffs {code} So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :) The resulting model has nan coeffs: {{array([ nan])}}. Furthermore, if you comment records line by line you will get: * [-1.55897475e+296] coeff (the first record is commented), * [-8.62115396e+104] coeff (the first two records are commented), * etc It looks like the implemented regression algorithms diverges somehow. I get almost the same results on Ridge and Lasso. I've also tested these inputs in scikit-learn and it works as expected there. However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1783) Title contains html code in MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001183#comment-14001183 ] Xiangrui Meng commented on SPARK-1783: -- Added `displayTitle` variable to the global layout. If this is defined, use it instead of `title` for page title in `h1`. Title contains html code in MLlib guide --- Key: SPARK-1783 URL: https://issues.apache.org/jira/browse/SPARK-1783 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor We use --- layout: global title: a href=mllib-guide.htmlMLlib/a - Clustering --- to create a link in the title to the main page of MLlib's guide. However, the generated title contains raw html code, which shows up in the tab or title bar of the browser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0
[ https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1871: - Summary: Improve MLlib guide for v1.0 (was: Improve MLlib guide) Improve MLlib guide for v1.0 Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1872) Update api links for unidoc
Xiangrui Meng created SPARK-1872: Summary: Update api links for unidoc Key: SPARK-1872 URL: https://issues.apache.org/jira/browse/SPARK-1872 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Reporter: Xiangrui Meng Should use unidoc for API links. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1783) Title contains html code in MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1783: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-1871 Title contains html code in MLlib guide --- Key: SPARK-1783 URL: https://issues.apache.org/jira/browse/SPARK-1783 Project: Spark Issue Type: Sub-task Components: Documentation, MLlib Affects Versions: 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Priority: Minor We use --- layout: global title: a href=mllib-guide.htmlMLlib/a - Clustering --- to create a link in the title to the main page of MLlib's guide. However, the generated title contains raw html code, which shows up in the tab or title bar of the browser. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0
[ https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1871: - Description: More improvements to MLlib guide. Improve MLlib guide for v1.0 Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng More improvements to MLlib guide. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195 ] Xiangrui Meng commented on SPARK-1870: -- I specified the jar via `--jars` and add it with `sc.addJar` explicitly. In the Web UI, I see: {code} /mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar System Classpath http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User {code} So it is in distributed cache as well as served by master via http. However, I still got ClassNotFoundException. Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Critical With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195 ] Xiangrui Meng edited comment on SPARK-1870 at 5/18/14 8:36 PM: --- I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In the Web UI, I see: {code} /mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/ container_1398708946838_0152_01_01/hello_2.10.jar System Classpath http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User {code} So it is in distributed cache as well as served by master via http. However, I still got ClassNotFoundException. was (Author: mengxr): I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In the Web UI, I see: {code} /mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar System Classpath http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User {code} So it is in distributed cache as well as served by master via http. However, I still got ClassNotFoundException. Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Critical With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1871) Improve MLlib guide
[ https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1871: - Component/s: Documentation Improve MLlib guide --- Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1871) Improve MLlib guide
Xiangrui Meng created SPARK-1871: Summary: Improve MLlib guide Key: SPARK-1871 URL: https://issues.apache.org/jira/browse/SPARK-1871 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1874) Clean up MLlib sample data
[ https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001254#comment-14001254 ] Xiangrui Meng commented on SPARK-1874: -- Is `data/mllib` a better place than `mllib/data`? Clean up MLlib sample data -- Key: SPARK-1874 URL: https://issues.apache.org/jira/browse/SPARK-1874 Project: Spark Issue Type: Bug Components: MLlib Reporter: Matei Zaharia Fix For: 1.0.0 - Replace logistic regression example data with linear to make mllib.LinearRegression example easier to run - Move files from mllib/data into data/mllib to make them easier to find - Add a simple MovieLens data file -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1874) Clean up MLlib sample data
[ https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002535#comment-14002535 ] Xiangrui Meng commented on SPARK-1874: -- There are three files under `data/`: `kmeans_data.txt`, `lr_data.txt`, and `pagerank_data.txt`, while more files under `mllib/data`. It feels more natural to me to keep the sample data under `mllib/data`. Anyway, I will create sample data first. Clean up MLlib sample data -- Key: SPARK-1874 URL: https://issues.apache.org/jira/browse/SPARK-1874 Project: Spark Issue Type: Bug Components: MLlib Reporter: Matei Zaharia Fix For: 1.0.0 - Replace logistic regression example data with linear to make mllib.LinearRegression example easier to run - Move files from mllib/data into data/mllib to make them easier to find - Add a simple MovieLens data file -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002613#comment-14002613 ] Xiangrui Meng commented on SPARK-1870: -- I tested it on a Spark 1.0RC standalone cluster but didn't find the same problem. Classes are loaded properly. Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Critical With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files
[ https://issues.apache.org/jira/browse/SPARK-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1861. -- Resolution: Implemented Patch will be included for the next Hadoop release (1.3.0, 2.5.0). ArrayIndexOutOfBoundsException when reading bzip2 files --- Key: SPARK-1861 URL: https://issues.apache.org/jira/browse/SPARK-1861 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.0, 1.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Hadoop uses CBZip2InputStream to decode bzip2 files. However, the implementation is not threadsafe and Spark may run multiple tasks in the same JVM, which leads to this error. This is not a problem for Hadoop MapReduce because Hadoop runs each task in a separate JVM. A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a standalone cluster. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1870: - Issue Type: Sub-task (was: Bug) Parent: SPARK-1905 Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Sub-task Components: YARN Reporter: Xiangrui Meng Priority: Critical Fix For: 1.0.0 With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode
Xiangrui Meng created SPARK-1906: Summary: spark-submit doesn't send master URL to Driver in standalone cluster mode Key: SPARK-1906 URL: https://issues.apache.org/jira/browse/SPARK-1906 Project: Spark Issue Type: Sub-task Components: Deploy Reporter: Xiangrui Meng Priority: Minor `spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter creates an empty SparkConf and throws an exception: {code} A master URL must be set in your configuration {code} The workaround is setting master explicitly in the user application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1908) Support local app jar in standalone cluster mode
Xiangrui Meng created SPARK-1908: Summary: Support local app jar in standalone cluster mode Key: SPARK-1908 URL: https://issues.apache.org/jira/browse/SPARK-1908 Project: Spark Issue Type: Sub-task Components: Deploy Reporter: Xiangrui Meng Priority: Minor Standalone cluster mode only supports app jar with a URL that is accessible from all cluster nodes, e.g., a jar on HDFS or a local file that is available on each cluster nodes. It would be nice to support app jar that is only available on the client. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1909) --jars is not supported in standalone cluster mode
Xiangrui Meng created SPARK-1909: Summary: --jars is not supported in standalone cluster mode Key: SPARK-1909 URL: https://issues.apache.org/jira/browse/SPARK-1909 Project: Spark Issue Type: Sub-task Components: Deploy Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Minor --jars is not processed in `spark-submit` for standalone cluster mode. The workaround is building an assembly app jar. It might be easy to support user jars that is accessible from a cluster node by setting `spark.jars` property correctly and passing it to `DriverWrapper`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1900) Fix running PySpark files on YARN
[ https://issues.apache.org/jira/browse/SPARK-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1900: - Issue Type: Sub-task (was: Bug) Parent: SPARK-1652 Fix running PySpark files on YARN -- Key: SPARK-1900 URL: https://issues.apache.org/jira/browse/SPARK-1900 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.0 If I run the following on a YARN cluster {code} bin/spark-submit sheep.py --master yarn-client {code} it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file: {code} bin/spark-submit file:/path/to/sheep.py --master yarn-client {code} However, this also fails. This time it is because python does not understand URI schemes. This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1900) Fix running PySpark files on YARN
[ https://issues.apache.org/jira/browse/SPARK-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1900: - Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-1905) Fix running PySpark files on YARN -- Key: SPARK-1900 URL: https://issues.apache.org/jira/browse/SPARK-1900 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Andrew Or Priority: Blocker Fix For: 1.0.0 If I run the following on a YARN cluster {code} bin/spark-submit sheep.py --master yarn-client {code} it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file: {code} bin/spark-submit file:/path/to/sheep.py --master yarn-client {code} However, this also fails. This time it is because python does not understand URI schemes. This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN
[ https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1870: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-1652 Jars specified via --jars in spark-submit are not added to executor classpath for YARN -- Key: SPARK-1870 URL: https://issues.apache.org/jira/browse/SPARK-1870 Project: Spark Issue Type: Sub-task Components: YARN Reporter: Xiangrui Meng Priority: Critical Fix For: 1.0.0 With `spark-submit`, jars specified via `--jars` are added to distributed cache in `yarn-cluster` mode. The executor should add cached jars to classpath. However, {code} sc.parallelize(0 to 10, 10).map { i = System.getProperty(java.class.path) }.collect().foreach(println) {code} shows only system jars, `app.jar`, and `spark.jar` but not other jars in the distributed cache. The workaround is using assembly jar. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1906: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-1905) spark-submit doesn't send master URL to Driver in standalone cluster mode - Key: SPARK-1906 URL: https://issues.apache.org/jira/browse/SPARK-1906 Project: Spark Issue Type: Improvement Components: Deploy Reporter: Xiangrui Meng Priority: Minor `spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter creates an empty SparkConf and throws an exception: {code} A master URL must be set in your configuration {code} The workaround is setting master explicitly in the user application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1909) --jars is not supported in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1909: - Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-1905) --jars is not supported in standalone cluster mode Key: SPARK-1909 URL: https://issues.apache.org/jira/browse/SPARK-1909 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 1.0.0 Reporter: Xiangrui Meng Priority: Minor --jars is not processed in `spark-submit` for standalone cluster mode. The workaround is building an assembly app jar. It might be easy to support user jars that is accessible from a cluster node by setting `spark.jars` property correctly and passing it to `DriverWrapper`. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1908) Support local app jar in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1908: - Issue Type: Bug (was: Sub-task) Parent: (was: SPARK-1905) Support local app jar in standalone cluster mode Key: SPARK-1908 URL: https://issues.apache.org/jira/browse/SPARK-1908 Project: Spark Issue Type: Bug Components: Deploy Reporter: Xiangrui Meng Priority: Minor Standalone cluster mode only supports app jar with a URL that is accessible from all cluster nodes, e.g., a jar on HDFS or a local file that is available on each cluster nodes. It would be nice to support app jar that is only available on the client. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1908) Support local app jar in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1908: - Issue Type: Sub-task (was: Bug) Parent: SPARK-1652 Support local app jar in standalone cluster mode Key: SPARK-1908 URL: https://issues.apache.org/jira/browse/SPARK-1908 Project: Spark Issue Type: Sub-task Components: Deploy Reporter: Xiangrui Meng Priority: Minor Standalone cluster mode only supports app jar with a URL that is accessible from all cluster nodes, e.g., a jar on HDFS or a local file that is available on each cluster nodes. It would be nice to support app jar that is only available on the client. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-1906: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-1652 spark-submit doesn't send master URL to Driver in standalone cluster mode - Key: SPARK-1906 URL: https://issues.apache.org/jira/browse/SPARK-1906 Project: Spark Issue Type: Sub-task Components: Deploy Reporter: Xiangrui Meng Priority: Minor `spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter creates an empty SparkConf and throws an exception: {code} A master URL must be set in your configuration {code} The workaround is setting master explicitly in the user application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1905) Issues with `spark-submit`
[ https://issues.apache.org/jira/browse/SPARK-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-1905. -- Resolution: Duplicate Issues with `spark-submit` -- Key: SPARK-1905 URL: https://issues.apache.org/jira/browse/SPARK-1905 Project: Spark Issue Type: Task Components: Deploy Reporter: Xiangrui Meng Main JIRA for tracking the issues with `spark-submit` in different deploy modes. For deploy modes, `client` means running driver on the client side (outside the cluster), while `cluster` means running driver inside the cluster. Ideally, we want to have `spark-submit` support all deploy modes in a unified way, which include * documented ** local client ** standalone client ** standalone cluster ** yarn client ** yarn cluster ** mesos client * undocumented (for testing only) ** local cluster * not supported ** mesos cluster In each deploy mode, we also need to test launching python jobs. -- This message was sent by Atlassian JIRA (v6.2#6252)