[jira] [Resolved] (SPARK-958) When iteration in ALS increases to 10 running in local mode, spark throws out error of StackOverflowError

2014-04-01 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-958.
-

Resolution: Duplicate

 When iteration in ALS increases to 10 running in local mode, spark throws out 
 error of StackOverflowError
 -

 Key: SPARK-958
 URL: https://issues.apache.org/jira/browse/SPARK-958
 Project: Spark
  Issue Type: Bug
Reporter: Qiuzhuang Lian

 I try to use ml-100k data to test ALS running in local mode in mllib project. 
 If I specify iteration to be less than, it works well. However, when 
 iteration is increased to more than 10 iterations,  spark throws out error of 
 StackOverflowError.
 Attached is the log file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x

2014-04-03 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1410:


 Summary: Class not found exception with application launched from 
sbt 0.13.x
 Key: SPARK-1410
 URL: https://issues.apache.org/jira/browse/SPARK-1410
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


sbt 0.13.x use its own loader but this is not available at worker side:

org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
sbt.classpath.ClasspathFilter@47ed40d

A workaround is to switch to sbt 0.12.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1410) Class not found exception with application launched from sbt 0.13.x

2014-04-03 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13959363#comment-13959363
 ] 

Xiangrui Meng commented on SPARK-1410:
--

The code is available at https://github.com/mengxr/mllib-grid-search

If sbt.version is changed to 0.13.0 or 0.13.1, `sbt/sbt run ...` throws the 
following exception:

~~~
[error] (run-main-0) org.apache.spark.SparkException: Job aborted: 
ClassNotFound with classloader: sbt.classpath.ClasspathFilter@45307fb
org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
sbt.classpath.ClasspathFilter@45307fb
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1010)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1008)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1008)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:595)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:595)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:146)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
~~~

 Class not found exception with application launched from sbt 0.13.x
 ---

 Key: SPARK-1410
 URL: https://issues.apache.org/jira/browse/SPARK-1410
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 sbt 0.13.x use its own loader but this is not available at worker side:
 org.apache.spark.SparkException: Job aborted: ClassNotFound with classloader: 
 sbt.classpath.ClasspathFilter@47ed40d
 A workaround is to switch to sbt 0.12.4.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1434) Make labelParser Java friendly.

2014-04-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1434:
-

Component/s: MLlib

 Make labelParser Java friendly.
 ---

 Key: SPARK-1434
 URL: https://issues.apache.org/jira/browse/SPARK-1434
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor
 Fix For: 1.0.0


 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java 
 users won't like it. So I make a trait for LabelParser and provide two 
 implementations: binary and multiclass.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-04-07 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13962048#comment-13962048
 ] 

Xiangrui Meng commented on SPARK-1406:
--

I think we should support PMML import/export in MLlib. PMML also provides 
feature transformations, which MLlib has very limited support at this time. The 
question is 1) how we take leverage on existing PMML packages, 2)  how many 
people volunteer.

Sean, it would be super helpful if you can share some experience on Oryx's PMML 
support, since I'm also not sure about whether this is the right time to start.

 PMML model evaluation support via MLib
 --

 Key: SPARK-1406
 URL: https://issues.apache.org/jira/browse/SPARK-1406
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Thomas Darimont

 It would be useful if spark would provide support the evaluation of PMML 
 models (http://www.dmg.org/v4-2/GeneralStructure.html).
 This would allow to use analytical models that were created with a 
 statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
 would perform the actual model evaluation for a given input tuple. The PMML 
 model would then just contain the parameterization of an analytical model.
 Other projects like JPMML-Evaluator do a similar thing.
 https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1218) Minibatch SGD with random sampling

2014-04-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1218.
--

   Resolution: Fixed
Fix Version/s: 0.9.0

Fixed in 0.9.0 or an earlier version.

 Minibatch SGD with random sampling
 --

 Key: SPARK-1218
 URL: https://issues.apache.org/jira/browse/SPARK-1218
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ameet Talwalkar
Assignee: Shivaram Venkataraman
 Fix For: 0.9.0


 Takes a gradient function as input.  At each iteration, we run stochastic 
 gradient descent locally on each worker with a fraction of the data points 
 selected randomly and with replacement (i.e., sampled points may overlap 
 across iterations).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1217) Add proximal gradient updater.

2014-04-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1217.
--

   Resolution: Fixed
Fix Version/s: 0.9.0

 Add proximal gradient updater.
 --

 Key: SPARK-1217
 URL: https://issues.apache.org/jira/browse/SPARK-1217
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Ameet Talwalkar
 Fix For: 0.9.0


 Add proximal gradient updater, in particular for L1 regularization.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1219) Minibatch SGD with disjoint partitions

2014-04-07 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1219.
--

Resolution: Fixed

Implemented in 0.9.0 or an earlier version.

 Minibatch SGD with disjoint partitions
 --

 Key: SPARK-1219
 URL: https://issues.apache.org/jira/browse/SPARK-1219
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ameet Talwalkar

 Takes a gradient function as input.  At each iteration, we run stochastic 
 gradient descent locally on each worker with a fraction (alpha) of the data 
 points selected randomly and disjointly (i.e., we ensure that we touch all 
 datapoints after at most 1/alpha iterations).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1357) [MLLIB] Annotate developer and experimental API's

2014-04-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13964405#comment-13964405
 ] 

Xiangrui Meng commented on SPARK-1357:
--

Hi Sean, 

Actually, you came in just in time. This was only the first pass, and we are 
still accepting API visibility/annotation patches during the QA period. MLlib 
is still a beta component of Spark, so 1.0 doesn't mean it is stable. And we 
still accept additions (JIRA submitted before April 1) to MLlib, as Patrick 
announced in the dev mailing list.

(I do want to mark all of MLlib experimental to reserve the right to change in 
the future, but we need to find a balance point here.)

I agree that it is future-proof to switch id type from Int to Long in ALS. The 
extra storage requirement is 8 bytes per rating. Inside ALS, we also 
re-partition the ratings, which needs extra storage. We need to consider 
whether we want to switch to Long completely or provide an option to use Long 
ids. Could you submit a patch, either marking ALS experimental or allowing 
using Long ids?

I don't think String type is necessary because we can alway creates a map 
between String ids and Long ids. A String id usually costs more than a Long id. 
For the same reason, classification uses Double for labels.

Please submit a patch for APIs you don't feel comfortable to say stable or 
marked experimental/developer by me but you think the other way. It would be 
great to keep the discussion going. Thanks!

Best,
Xiangrui

 [MLLIB] Annotate developer and experimental API's
 -

 Key: SPARK-1357
 URL: https://issues.apache.org/jira/browse/SPARK-1357
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
 Fix For: 1.0.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1215) Clustering: Index out of bounds error

2014-04-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13969718#comment-13969718
 ] 

Xiangrui Meng commented on SPARK-1215:
--

The error was due to small number of points and large k. The k-means|| 
initialization doesn't collect more than k candidates. This is very unlikely to 
appear in practice because k is much smaller than number of points. I will 
re-visit this issue once we implement better weighted sampling algorithms.

 Clustering: Index out of bounds error
 -

 Key: SPARK-1215
 URL: https://issues.apache.org/jira/browse/SPARK-1215
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: dewshick
Assignee: Xiangrui Meng
Priority: Minor

 code:
 import org.apache.spark.mllib.clustering._
 val test = sc.makeRDD(Array(4,4,4,4,4).map(e = Array(e.toDouble)))
 val kmeans = new KMeans().setK(4)
 kmeans.run(test) evals with java.lang.ArrayIndexOutOfBoundsException
 error:
 14/01/17 12:35:54 INFO scheduler.DAGScheduler: Stage 25 (collectAsMap at 
 KMeans.scala:243) finished in 0.047 s
 14/01/17 12:35:54 INFO spark.SparkContext: Job finished: collectAsMap at 
 KMeans.scala:243, took 16.389537116 s
 Exception in thread main java.lang.reflect.InvocationTargetException
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at com.simontuffs.onejar.Boot.run(Boot.java:340)
   at com.simontuffs.onejar.Boot.main(Boot.java:166)
 Caused by: java.lang.ArrayIndexOutOfBoundsException: -1
   at 
 org.apache.spark.mllib.clustering.LocalKMeans$.kMeansPlusPlus(LocalKMeans.scala:47)
   at 
 org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:247)
   at 
 org.apache.spark.mllib.clustering.KMeans$$anonfun$19.apply(KMeans.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at scala.collection.immutable.Range.foreach(Range.scala:81)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
   at scala.collection.immutable.Range.map(Range.scala:46)
   at 
 org.apache.spark.mllib.clustering.KMeans.initKMeansParallel(KMeans.scala:244)
   at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:124)
   at Clustering$$anonfun$1.apply$mcDI$sp(Clustering.scala:21)
   at Clustering$$anonfun$1.apply(Clustering.scala:19)
   at Clustering$$anonfun$1.apply(Clustering.scala:19)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:233)
   at scala.collection.immutable.Range.foreach(Range.scala:78)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
   at scala.collection.immutable.Range.map(Range.scala:46)
   at Clustering$.main(Clustering.scala:19)
   at Clustering.main(Clustering.scala)
   ... 6 more



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1503) Implement Nesterov's accelerated first-order method

2014-04-15 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1503:


 Summary: Implement Nesterov's accelerated first-order method
 Key: SPARK-1503
 URL: https://issues.apache.org/jira/browse/SPARK-1503
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Nesterov's accelerated first-order method is a drop-in replacement for steepest 
descent but it converges much faster. We should implement this method and 
compare its performance with existing algorithms, including SGD and L-BFGS.

TFOCS (http://cvxr.com/tfocs/) is a reference implementation of Nesterov's 
method and its variants on composite objectives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1506) Documentation improvements for MLlib 1.0

2014-04-15 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1506:


 Summary: Documentation improvements for MLlib 1.0
 Key: SPARK-1506
 URL: https://issues.apache.org/jira/browse/SPARK-1506
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Blocker


Proposed TOC:

Linear algebra

* vector and matrix
* distributed matrix

Classification and regression

* generalized linear models
* support vector machine
* decision tree
* naive Bayes

Collaborative filtering

* alternating least squares (ALS)

Clustering

* k-means

Dimension reduction

* principal component analysis (PCA)

Optimization

* stochastic gradient descent
* limited-memory BFGS



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6

2014-04-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973291#comment-13973291
 ] 

Xiangrui Meng commented on SPARK-1520:
--

I'm using Java 6 JDK located at 
/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home on a mac. It 
can create a jar with more than 65536 files. I also found this JIRA:

https://bugs.openjdk.java.net/browse/JDK-4828461 (Support Zip files with more 
than 64k entries)

which was fixed in version 6. Note that this is for openjdk.

I'm going to check the headers of assembly jars created by java 6 and 7.

 Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 *Isolation*
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 I've found that if I just unpack and re-pack the jar (using `jar` from java 6 
 or 7) it always works:
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6

2014-04-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973306#comment-13973306
 ] 

Xiangrui Meng commented on SPARK-1520:
--

When I try to use jar-1.6 to untar the assembly jar created by java 7:

~~~
java.util.zip.ZipException: invalid CEN header (bad signature)
at java.util.zip.ZipFile.open(Native Method)
at java.util.zip.ZipFile.init(ZipFile.java:128)
at java.util.zip.ZipFile.init(ZipFile.java:89)
at sun.tools.jar.Main.list(Main.java:977)
at sun.tools.jar.Main.run(Main.java:222)
at sun.tools.jar.Main.main(Main.java:1147)
~~~

7z shows:

~~~
Path = spark-assembly-1.6.jar
Type = zip
Physical Size = 119682511

Path = spark-assembly-1.7.jar
Type = zip
64-bit = +
Physical Size = 119682587
~~~

I think the number of files limit is already increased in Java 6 (at least in 
the latest update), but Java 7 will use zip64 format for more than 64k  files, 
and this format cannot be recognized by Java 6.

 Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 *Isolation*
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 I've found that if I just unpack and re-pack the jar (using `jar` from java 6 
 or 7) it always works:
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6

2014-04-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973326#comment-13973326
 ] 

Xiangrui Meng edited comment on SPARK-1520 at 4/17/14 7:59 PM:
---

The quick fix may be removing fastutil, so Java 7 still generates the assembly 
jar in zip format instead of zip64.

In RDD#countApproxDistinct, we use HyperLogLog from 
com.clearspring.analytics:stream, which depends on fastutil. If this is the 
only place that introduces fastutil dependency, we should implement HyperLogLog 
and remove fastutil completely from Spark's dependencies.


was (Author: mengxr):
The quick fix may be removing fastutil.

In RDD#countApproxDistinct, we use HyperLogLog from 
com.clearspring.analytics:stream, which depends on fastutil. If this is the 
only place that introduces fastutil dependency, we should implement HyperLogLog 
and remove fastutil completely from Spark's dependencies.

 Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 *Isolation*
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 I've found that if I just unpack and re-pack the jar (using `jar` from java 6 
 or 7) it always works:
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1520) Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6

2014-04-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973326#comment-13973326
 ] 

Xiangrui Meng commented on SPARK-1520:
--

The quick fix may be removing fastutil.

In RDD#countApproxDistinct, we use HyperLogLog from 
com.clearspring.analytics:stream, which depends on fastutil. If this is the 
only place that introduces fastutil dependency, we should implement HyperLogLog 
and remove fastutil completely from Spark's dependencies.

 Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 *Isolation*
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 I've found that if I just unpack and re-pack the jar (using `jar` from java 6 
 or 7) it always works:
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6

2014-04-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13973463#comment-13973463
 ] 

Xiangrui Meng commented on SPARK-1520:
--

It seems HyperLogLog doesn't need fastutil, so we can exclude fastutil 
directly. Will send a patch.

 Assembly Jar with more than 65536 files won't work when compiled on  JDK7 and 
 run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 h1. Isolation and Cause
 The package-time behavior of Java 6 and 7 differ with respect to the format 
 used for jar files:
 ||Number of entries||JDK 6||JDK 7||
 |= 65536|zip|zip|
 | 65536|zip*|zip64|
 zip* is a workaround for the original zip format that [described in 
 JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows 
 some versions of Java 6 to support larger assembly jars.
 The Scala libraries we depend on have added a large number of classes which 
 bumped us over the limit. This causes the Java 7 packaging to not work with 
 Java 6. We can probably go back under the limit by clearing out some 
 accidental inclusion of FastUtil, but eventually we'll go over again.
 The real answer is to force people to build with JDK 6 if they want to run 
 Spark on JRE 6.
 -I've found that if I just unpack and re-pack the jar (using `jar`) it always 
 works:-
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 -I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.-
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1464) Update MLLib Examples to Use Breeze

2014-04-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1464.
--

Resolution: Duplicate

 Update MLLib Examples to Use Breeze
 ---

 Key: SPARK-1464
 URL: https://issues.apache.org/jira/browse/SPARK-1464
 Project: Spark
  Issue Type: Task
  Components: MLlib
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 If we want to deprecate the vector class we need to update all of the 
 examples to use Breeze.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6

2014-04-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-1520:


Assignee: Xiangrui Meng

 Assembly Jar with more than 65536 files won't work when compiled on  JDK7 and 
 run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 h1. Isolation and Cause
 The package-time behavior of Java 6 and 7 differ with respect to the format 
 used for jar files:
 ||Number of entries||JDK 6||JDK 7||
 |= 65536|zip|zip|
 | 65536|zip*|zip64|
 zip* is a workaround for the original zip format that [described in 
 JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows 
 some versions of Java 6 to support larger assembly jars.
 The Scala libraries we depend on have added a large number of classes which 
 bumped us over the limit. This causes the Java 7 packaging to not work with 
 Java 6. We can probably go back under the limit by clearing out some 
 accidental inclusion of FastUtil, but eventually we'll go over again.
 The real answer is to force people to build with JDK 6 if they want to run 
 Spark on JRE 6.
 -I've found that if I just unpack and re-pack the jar (using `jar`) it always 
 works:-
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 -I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.-
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1533) The (kill) button in the web UI is visible to everyone.

2014-04-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1533:


 Summary: The (kill) button in the web UI is visible to everyone.
 Key: SPARK-1533
 URL: https://issues.apache.org/jira/browse/SPARK-1533
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


We can kill jobs from web UI now, which is great. But there is no 
authentication in the standalone mode, e.g., clusters created by spark-ec2.
Then everyone can visit a standalone server and kill jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1485) Implement AllReduce

2014-04-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1485:
-

Affects Version/s: (was: 1.0.0)

 Implement AllReduce
 ---

 Key: SPARK-1485
 URL: https://issues.apache.org/jira/browse/SPARK-1485
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 The current implementations of machine learning algorithms rely on the driver 
 for some computation and data broadcasting. This will create a bottleneck at 
 the driver for both computation and communication, especially in multi-model 
 training. An efficient implementation of AllReduce (or AllAggregate) can help 
 free the driver:
 allReduce(RDD[T], (T, T) = T): RDD[T]
 This JIRA is created for discussing how to implement AllReduce efficiently 
 and possible alternatives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1561:


 Summary: sbt/sbt assembly generates too many local files
 Key: SPARK-1561
 URL: https://issues.apache.org/jira/browse/SPARK-1561
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


Running `find ./ | wc -l` after `sbt/sbt assembly` returned 

564365

This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1561:
-

Description: 
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 



This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.


  was:
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 

564365

This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.



 sbt/sbt assembly generates too many local files
 ---

 Key: SPARK-1561
 URL: https://issues.apache.org/jira/browse/SPARK-1561
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
 This hits the default limit of #INode of an 8GB EXT FS (the default volume 
 size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
 assembly` on such a partition.
 Most of the small files are under assembly/target/streams and the same folder 
 under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976487#comment-13976487
 ] 

Xiangrui Meng commented on SPARK-1561:
--

Tried adding 

{code}
assemblyOption in assembly ~= { _.copy(cacheUnzip = false) },
assemblyOption in assembly ~= { _.copy(cacheOutput = false) },
{code}

to SparkBuild.scala, but it didn't help reduce the number of files.

 sbt/sbt assembly generates too many local files
 ---

 Key: SPARK-1561
 URL: https://issues.apache.org/jira/browse/SPARK-1561
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
 564365
 This hits the default limit of #INode of an 8GB EXT FS (the default volume 
 size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
 assembly` on such a partition.
 Most of the small files are under assembly/target/streams and the same folder 
 under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1561) sbt/sbt assembly generates too many local files

2014-04-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1561:
-

Description: 
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 

564365

This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.


  was:
Running `find ./ | wc -l` after `sbt/sbt assembly` returned 



This hits the default limit of #INode of an 8GB EXT FS (the default volume size 
for an EC2 instance), which means you can do nothing after 'sbt/sbt assembly` 
on such a partition.

Most of the small files are under assembly/target/streams and the same folder 
under examples/.



 sbt/sbt assembly generates too many local files
 ---

 Key: SPARK-1561
 URL: https://issues.apache.org/jira/browse/SPARK-1561
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 Running `find ./ | wc -l` after `sbt/sbt assembly` returned 
 564365
 This hits the default limit of #INode of an 8GB EXT FS (the default volume 
 size for an EC2 instance), which means you can do nothing after 'sbt/sbt 
 assembly` on such a partition.
 Most of the small files are under assembly/target/streams and the same folder 
 under examples/.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1595) Remove VectorRDDs

2014-04-23 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1595:
-

Affects Version/s: 1.0.0

 Remove VectorRDDs
 -

 Key: SPARK-1595
 URL: https://issues.apache.org/jira/browse/SPARK-1595
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 This is basically an RDD#map with Vectors.dense, maybe useful for Java users 
 but it is simple to write one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1599) Allow to use intercept in Ridge and Lasso

2014-04-23 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1599:


 Summary: Allow to use intercept in Ridge and Lasso
 Key: SPARK-1599
 URL: https://issues.apache.org/jira/browse/SPARK-1599
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


AddIntercept in Ridge and Lasso was disabled in 0.9.1 for a quick fix. For 1.0, 
we should remove this restriction.

Penalizing the intercept variable may not be the right thing to do. We can 
solve this problem by adding weighted regularization later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1634) Java API docs contain test cases

2014-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1634:
-

Summary: Java API docs contain test cases  (was: JavaDoc contains test 
cases)

 Java API docs contain test cases
 

 Key: SPARK-1634
 URL: https://issues.apache.org/jira/browse/SPARK-1634
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Blocker

 The generated Java API docs contains all test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1634) Java API docs contain test cases

2014-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1634:
-

Description: The generated Java API docs contain all test cases.  (was: The 
generated Java API docs contains all test cases.)

 Java API docs contain test cases
 

 Key: SPARK-1634
 URL: https://issues.apache.org/jira/browse/SPARK-1634
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Blocker

 The generated Java API docs contain all test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1635) Java API docs do not show annotation.

2014-04-25 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1635:


 Summary: Java API docs do not show annotation.
 Key: SPARK-1635
 URL: https://issues.apache.org/jira/browse/SPARK-1635
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


The generated Java API docs do not contain Developer/Experimental annotations. 
The :: Developer/Experimental :: tag is in the generated doc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1636) Move main methods to examples

2014-04-25 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1636:


 Summary: Move main methods to examples
 Key: SPARK-1636
 URL: https://issues.apache.org/jira/browse/SPARK-1636
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Move the main methods to examples and make them compatible with spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1598) Mark main methods experimental

2014-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1598.
--

Resolution: Duplicate

We will move main methods to examples instead.

 Mark main methods experimental
 --

 Key: SPARK-1598
 URL: https://issues.apache.org/jira/browse/SPARK-1598
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We should treat the parameters in the main methods as part of our APIs. They 
 are not quite consistent at this time, so we should mark them experimental 
 and look for a unified solution in the next sprint.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-1634) Java API docs contain test cases

2014-04-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-1634.


   Resolution: Not a Problem
Fix Version/s: 1.0.0
 Assignee: Xiangrui Meng

Re-tried with 'sbt/sbt clean`. The generated docs for test cases were gone.

 Java API docs contain test cases
 

 Key: SPARK-1634
 URL: https://issues.apache.org/jira/browse/SPARK-1634
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 The generated Java API docs contain all test cases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1668) Add implicit preference as an option to examples/MovieLensALS

2014-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1668:


 Summary: Add implicit preference as an option to 
examples/MovieLensALS
 Key: SPARK-1668
 URL: https://issues.apache.org/jira/browse/SPARK-1668
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Minor


Add --implicitPrefs as an command-line option to the example app MovieLensALS 
under examples/. For evaluation, we should map ratings to range [0, 1] and 
compare it with predictions. It would be better if we add unobserved ratings 
(assuming negatives) to evaluation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe

2014-04-29 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1674:


 Summary: Interrupted system call error in pyspark's RDD.pipe
 Key: SPARK-1674
 URL: https://issues.apache.org/jira/browse/SPARK-1674
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


RDD.pipe's doctest throws interrupted system call exception on Mac. It can be 
fixed by wrapping pipe.stdout.readline in an iterator.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1674) Interrupted system call error in pyspark's RDD.pipe

2014-04-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1674.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

 Interrupted system call error in pyspark's RDD.pipe
 ---

 Key: SPARK-1674
 URL: https://issues.apache.org/jira/browse/SPARK-1674
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 RDD.pipe's doctest throws interrupted system call exception on Mac. It can be 
 fixed by wrapping pipe.stdout.readline in an iterator.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1520) Assembly Jar with more than 65536 files won't work when compiled on JDK7 and run on JDK6

2014-05-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13989206#comment-13989206
 ] 

Xiangrui Meng commented on SPARK-1520:
--

Koert, which JDK6 did you use? This problem was fixed in a later version of 
openjdk-6. If you are using openjdk-6, you can try upgrading it to the latest 
version and see whether the problem still exists.

 Assembly Jar with more than 65536 files won't work when compiled on  JDK7 and 
 run on JDK6
 -

 Key: SPARK-1520
 URL: https://issues.apache.org/jira/browse/SPARK-1520
 Project: Spark
  Issue Type: Bug
  Components: MLlib, Spark Core
Reporter: Patrick Wendell
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 1.0.0


 This is a real doozie - when compiling a Spark assembly with JDK7, the 
 produced jar does not work well with JRE6. I confirmed the byte code being 
 produced is JDK 6 compatible (major version 50). What happens is that, 
 silently, the JRE will not load any class files from the assembled jar.
 {code}
 $ sbt/sbt assembly/assembly
 $ /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
 [FIFO|FAIR]
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
  org.apache.spark.ui.UIWorkloadGenerator
 Exception in thread main java.lang.NoClassDefFoundError: 
 org/apache/spark/ui/UIWorkloadGenerator
 Caused by: java.lang.ClassNotFoundException: 
 org.apache.spark.ui.UIWorkloadGenerator
   at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
 Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
 Program will exit.
 {code}
 I also noticed that if the jar is unzipped, and the classpath set to the 
 currently directory, it just works. Finally, if the assembly jar is 
 compiled with JDK6, it also works. The error is seen with any class, not just 
 the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
 in master.
 h1. Isolation and Cause
 The package-time behavior of Java 6 and 7 differ with respect to the format 
 used for jar files:
 ||Number of entries||JDK 6||JDK 7||
 |= 65536|zip|zip|
 | 65536|zip*|zip64|
 zip* is a workaround for the original zip format that [described in 
 JDK-6828461|https://bugs.openjdk.java.net/browse/JDK-4828461] that allows 
 some versions of Java 6 to support larger assembly jars.
 The Scala libraries we depend on have added a large number of classes which 
 bumped us over the limit. This causes the Java 7 packaging to not work with 
 Java 6. We can probably go back under the limit by clearing out some 
 accidental inclusion of FastUtil, but eventually we'll go over again.
 The real answer is to force people to build with JDK 6 if they want to run 
 Spark on JRE 6.
 -I've found that if I just unpack and re-pack the jar (using `jar`) it always 
 works:-
 {code}
 $ cd assembly/target/scala-2.10/
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # fails
 $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
 $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
 $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
 ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
 org.apache.spark.ui.UIWorkloadGenerator # succeeds
 {code}
 -I also noticed something of note. The Breeze package contains single 
 directories that have huge numbers of files in them (e.g. 2000+ class files 
 in one directory). It's possible we are hitting some weird bugs/corner cases 
 with compatibility of the internal storage format of the jar itself.-
 -I narrowed this down specifically to the inclusion of the breeze library. 
 Just adding breeze to an older (unaffected) build triggered the issue.-
 -I ran a git bisection and this appeared after the MLLib sparse vector patch 
 was merged:-
 https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
 SPARK-1212



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1723) Add saveAsLibSVMFile

2014-05-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1723:


 Summary: Add saveAsLibSVMFile
 Key: SPARK-1723
 URL: https://issues.apache.org/jira/browse/SPARK-1723
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Provide a method to save labeled data in LIBSVM format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1724) Add appendBias

2014-05-05 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1724:


 Summary: Add appendBias
 Key: SPARK-1724
 URL: https://issues.apache.org/jira/browse/SPARK-1724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Add `appendBias` to MLUtils to avoid adding the bias term in computation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1595) Remove VectorRDDs

2014-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1595.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/524

 Remove VectorRDDs
 -

 Key: SPARK-1595
 URL: https://issues.apache.org/jira/browse/SPARK-1595
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 This is basically an RDD#map with Vectors.dense, maybe useful for Java users 
 but it is simple to write one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1723) Add saveAsLibSVMFile

2014-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1723.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/524

 Add saveAsLibSVMFile
 

 Key: SPARK-1723
 URL: https://issues.apache.org/jira/browse/SPARK-1723
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 Provide a method to save labeled data in LIBSVM format.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1596) Re-arrange public methods in evaluation.

2014-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1596.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/524

 Re-arrange public methods in evaluation.
 

 Key: SPARK-1596
 URL: https://issues.apache.org/jira/browse/SPARK-1596
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 binary.BinaryClassificationMetrics is the only public API under evaluation. 
 We should move it one level up, and possibly clean the names.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1724) Add appendBias

2014-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1724.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/524

 Add appendBias
 --

 Key: SPARK-1724
 URL: https://issues.apache.org/jira/browse/SPARK-1724
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 Add `appendBias` to MLUtils to avoid adding the bias term in computation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1599) Allow to use intercept in Ridge and Lasso

2014-05-05 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1599.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

https://github.com/apache/spark/pull/524

 Allow to use intercept in Ridge and Lasso
 -

 Key: SPARK-1599
 URL: https://issues.apache.org/jira/browse/SPARK-1599
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.0.0


 AddIntercept in Ridge and Lasso was disabled in 0.9.1 for a quick fix. For 
 1.0, we should remove this restriction.
 Penalizing the intercept variable may not be the right thing to do. We can 
 solve this problem by adding weighted regularization later.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1741) Add predict(JavaRDD) to predictive models

2014-05-06 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1741:


 Summary: Add predict(JavaRDD) to predictive models
 Key: SPARK-1741
 URL: https://issues.apache.org/jira/browse/SPARK-1741
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng


`model.predict` returns a RDD of Scala primitive type (Int/Double), which is 
recognized as Object in Java. Adding predict(JavaRDD) could make life easier 
for Java users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1743) Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark

2014-05-06 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1743:


 Summary: Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} 
to pyspark
 Key: SPARK-1743
 URL: https://issues.apache.org/jira/browse/SPARK-1743
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng


Make loading/saving labeled data easier for pyspark users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1743) Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark

2014-05-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991373#comment-13991373
 ] 

Xiangrui Meng commented on SPARK-1743:
--

PR: https://github.com/apache/spark/pull/672

 Add mllib.util.MLUtils.{loadLibSVMFile, saveAsLibSVMFile} to pyspark
 

 Key: SPARK-1743
 URL: https://issues.apache.org/jira/browse/SPARK-1743
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Make loading/saving labeled data easier for pyspark users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1741) Add predict(JavaRDD) to predictive models

2014-05-06 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13991378#comment-13991378
 ] 

Xiangrui Meng commented on SPARK-1741:
--

https://github.com/apache/spark/pull/670

 Add predict(JavaRDD) to predictive models
 -

 Key: SPARK-1741
 URL: https://issues.apache.org/jira/browse/SPARK-1741
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 `model.predict` returns a RDD of Scala primitive type (Int/Double), which is 
 recognized as Object in Java. Adding predict(JavaRDD) could make life easier 
 for Java users.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1783) Title contains html code in MLlib guide

2014-05-14 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1783:


 Summary: Title contains html code in MLlib guide
 Key: SPARK-1783
 URL: https://issues.apache.org/jira/browse/SPARK-1783
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Minor


We use 

---
layout: global
title: a href=mllib-guide.htmlMLlib/a - Clustering
---

to create a link in the title to the main page of MLlib's guide. However, the 
generated title contains raw html code, which shows up in the tab or title bar 
of the browser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1675) Make clear whether computePrincipalComponents centers data

2014-05-14 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998103#comment-13998103
 ] 

Xiangrui Meng commented on SPARK-1675:
--

Centering in PCA should be the standard practice.

 Make clear whether computePrincipalComponents centers data
 --

 Key: SPARK-1675
 URL: https://issues.apache.org/jira/browse/SPARK-1675
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1668) Add implicit preference as an option to examples/MovieLensALS

2014-05-14 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1668.
--

   Resolution: Implemented
Fix Version/s: 1.0.0

 Add implicit preference as an option to examples/MovieLensALS
 -

 Key: SPARK-1668
 URL: https://issues.apache.org/jira/browse/SPARK-1668
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Sandeep Singh
Priority: Minor
 Fix For: 1.0.0


 Add --implicitPrefs as an command-line option to the example app MovieLensALS 
 under examples/. For evaluation, we should map ratings to range [0, 1] and 
 compare it with predictions. It would be better if we add unobserved ratings 
 (assuming negatives) to evaluation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1635) Java API docs do not show annotation.

2014-05-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1635:
-

Priority: Minor  (was: Major)

 Java API docs do not show annotation.
 -

 Key: SPARK-1635
 URL: https://issues.apache.org/jira/browse/SPARK-1635
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Minor

 The generated Java API docs do not contain Developer/Experimental 
 annotations. The :: Developer/Experimental :: tag is in the generated doc.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1696) RowMatrix.dspr is not using parameter alpha for DenseVector

2014-05-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998127#comment-13998127
 ] 

Xiangrui Meng commented on SPARK-1696:
--

Thanks! I sent a PR: https://github.com/apache/spark/pull/778

 RowMatrix.dspr is not using parameter alpha for DenseVector
 ---

 Key: SPARK-1696
 URL: https://issues.apache.org/jira/browse/SPARK-1696
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Anish Patel
Assignee: Xiangrui Meng
Priority: Minor

 In the master branch, method dspr of RowMatrix takes parameter alpha, but 
 does not use it when given a DenseVector.
 This probably slid by because when method computeGramianMatrix calls dspr, it 
 provides an alpha value of 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1696) RowMatrix.dspr is not using parameter alpha for DenseVector

2014-05-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1696.
--

   Resolution: Fixed
Fix Version/s: 1.0.0

 RowMatrix.dspr is not using parameter alpha for DenseVector
 ---

 Key: SPARK-1696
 URL: https://issues.apache.org/jira/browse/SPARK-1696
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Anish Patel
Assignee: Xiangrui Meng
Priority: Minor
 Fix For: 1.0.0


 In the master branch, method dspr of RowMatrix takes parameter alpha, but 
 does not use it when given a DenseVector.
 This probably slid by because when method computeGramianMatrix calls dspr, it 
 provides an alpha value of 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1605) Improve mllib.linalg.Vector

2014-05-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13998106#comment-13998106
 ] 

Xiangrui Meng commented on SPARK-1605:
--

`toBreeze` exposes a breeze type. We might want to mark it DeveloperApi and 
make it public, but I'm not sure whether we should do that in v1.0. Given a 
`mllib.linalg.Vector`, you can call `toArray` to get the values or operate 
directly on DenseVector/SparseVector.

 Improve mllib.linalg.Vector
 ---

 Key: SPARK-1605
 URL: https://issues.apache.org/jira/browse/SPARK-1605
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sandeep Singh

 We can make current Vector a wrapper around Breeze.linalg.Vector ?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1646) ALS micro-optimisation

2014-05-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1646.
--

   Resolution: Implemented
Fix Version/s: 1.0.0

PR: https://github.com/apache/spark/pull/568

 ALS micro-optimisation
 --

 Key: SPARK-1646
 URL: https://issues.apache.org/jira/browse/SPARK-1646
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tor Myklebust
Assignee: Tor Myklebust
Priority: Trivial
 Fix For: 1.0.0


 Scala for loop bodies turn into methods and the loops themselves into 
 repeated invocations of the body method.  This may make Hotspot make poor 
 optimisation decisions.  (Xiangrui mentioned that there was a speed 
 improvement from doing similar transformations elsewhere.)
 The loops on i and p in the ALS training code are prime candidates for this 
 transformation, as is the foreach loop doing regularisation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1359) SGD implementation is not efficient

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1359:
-

Affects Version/s: 1.0.0

 SGD implementation is not efficient
 ---

 Key: SPARK-1359
 URL: https://issues.apache.org/jira/browse/SPARK-1359
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0, 1.0.0
Reporter: Xiangrui Meng

 The SGD implementation samples a mini-batch to compute the stochastic 
 gradient. This is not efficient because examples are provided via an iterator 
 interface. We have to scan all of them to obtain a sample.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1485) Implement AllReduce

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1485:
-

Priority: Critical  (was: Major)

 Implement AllReduce
 ---

 Key: SPARK-1485
 URL: https://issues.apache.org/jira/browse/SPARK-1485
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 The current implementations of machine learning algorithms rely on the driver 
 for some computation and data broadcasting. This will create a bottleneck at 
 the driver for both computation and communication, especially in multi-model 
 training. An efficient implementation of AllReduce (or AllAggregate) can help 
 free the driver:
 allReduce(RDD[T], (T, T) = T): RDD[T]
 This JIRA is created for discussing how to implement AllReduce efficiently 
 and possible alternatives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999499#comment-13999499
 ] 

Xiangrui Meng commented on SPARK-1782:
--

Btw, this approach only gives us \Sigma and V. If we compute U via A V 
\Sigma^{-1} (current implementation in MLlib), very likely we lose 
orthogonality. For sparse SVD, the best package is PROPACK, which implements 
Lanczos bidiagonalization with partial reorthogonalization 
(http://sun.stanford.edu/~rmunk/PROPACK/). But let us use ARPACK now since we 
can call it from Breeze.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1752) Standardize input/output format for vectors and labeled points

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1752:
-

Fix Version/s: 1.1.0

 Standardize input/output format for vectors and labeled points
 --

 Key: SPARK-1752
 URL: https://issues.apache.org/jira/browse/SPARK-1752
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
 Fix For: 1.1.0


 We should standardize the text format used to represent vectors and labeled 
 points. The proposed formats are the following:
 1. dense vector: [v0,v1,..]
 2. sparse vector: (size,[i0,i1],[v0,v1])
 3. labeled point: (label,vector)
 where (..) indicates a tuple and [...] indicate an array. Those are 
 compatible with Python's syntax and can be easily parsed using `eval`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1553:
-

Fix Version/s: 1.1.0

 Support alternating nonnegative least-squares
 -

 Key: SPARK-1553
 URL: https://issues.apache.org/jira/browse/SPARK-1553
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Tor Myklebust
Assignee: Tor Myklebust
 Fix For: 1.1.0


 There's already an ALS implementation.  It can be tweaked to support 
 nonnegative least-squares by conditionally running a nonnegative 
 least-squares solve instead of a least-squares solver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1585) Not robust Lasso causes Infinity on weights and losses

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999123#comment-13999123
 ] 

Xiangrui Meng commented on SPARK-1585:
--

I think the gradient should pull the weights back. If I'm wrong, could you 
create an example code to demonstrate the problem? -Xiangrui

 Not robust Lasso causes Infinity on weights and losses
 --

 Key: SPARK-1585
 URL: https://issues.apache.org/jira/browse/SPARK-1585
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
Reporter: Xusen Yin
Assignee: Xusen Yin
 Fix For: 1.1.0


 Lasso uses LeastSquaresGradient and L1Updater, but 
 diff = brzWeights.dot(brzData) - label
 in LeastSquaresGradient would cause too big diff, then will affect the 
 L1Updater, which increases weights exponentially. Small shrinkage value 
 cannot lasso weights back to zero then. Finally, the weights and losses reach 
 Infinity.
 For example, data = (0.5 repeats 10k times), weights = (0.6 repeats 10k 
 times), then data.dot(weights) approximates 300+, the diff will be 300. Then 
 L1Updater sets weights to approximate 300. In the next iteration, the weights 
 will be set to approximate 3, and so on.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1855:


 Summary: Provide memory-and-local-disk RDD checkpointing
 Key: SPARK-1855
 URL: https://issues.apache.org/jira/browse/SPARK-1855
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


Checkpointing is used to cut long lineage while maintaining fault tolerance. 
The current implementation is HDFS-based. Using the BlockRDD we can create 
in-memory-and-local-disk (with replication) checkpoints that are not as 
reliable as HDFS-based solution but faster.

It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1485) Implement AllReduce

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1485:
-

Fix Version/s: 1.1.0

 Implement AllReduce
 ---

 Key: SPARK-1485
 URL: https://issues.apache.org/jira/browse/SPARK-1485
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 The current implementations of machine learning algorithms rely on the driver 
 for some computation and data broadcasting. This will create a bottleneck at 
 the driver for both computation and communication, especially in multi-model 
 training. An efficient implementation of AllReduce (or AllAggregate) can help 
 free the driver:
 allReduce(RDD[T], (T, T) = T): RDD[T]
 This JIRA is created for discussing how to implement AllReduce efficiently 
 and possible alternatives.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13999493#comment-13999493
 ] 

Xiangrui Meng commented on SPARK-1782:
--

This sounds good to me. Let's assume that A is m-by-n where n is small (1e6). 
Note that (A^T A) x = (A^T A)^T x can be computed in a single pass, which is 
sum_i (a_i^T x) a_i . So we don't need to implement A^T y, which simplifies the 
task.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1580) ALS: Estimate communication and computation costs given a partitioner

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1580:
-

Fix Version/s: 1.1.0

 ALS: Estimate communication and computation costs given a partitioner
 -

 Key: SPARK-1580
 URL: https://issues.apache.org/jira/browse/SPARK-1580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tor Myklebust
Priority: Minor
 Fix For: 1.1.0


 It would be nice to be able to estimate the amount of work needed to solve an 
 ALS problem.  The chief components of this work are computation time---time 
 spent forming and solving the least squares problems---and communication 
 cost---the number of bytes sent across the network.  Communication cost 
 depends heavily on how the users and products are partitioned.
 We currently do not try to cluster users or products so that fewer feature 
 vectors need to be communicated.  This is intended as a first step toward 
 that end---we ought to be able to tell whether one partitioning is better 
 than another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1856) Standardize MLlib interfaces

2014-05-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1856:


 Summary: Standardize MLlib interfaces
 Key: SPARK-1856
 URL: https://issues.apache.org/jira/browse/SPARK-1856
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


Instead of expanding MLlib based on the current class naming scheme 
(ProblemWithAlgorithm),  we should standardize MLlib's interfaces that clearly 
separate datasets, formulations, algorithms, parameter sets, and models.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files

2014-05-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1861:


 Summary: ArrayIndexOutOfBoundsException when reading bzip2 files
 Key: SPARK-1861
 URL: https://issues.apache.org/jira/browse/SPARK-1861
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Xiangrui Meng


Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
implementation is not threadsafe and Spark may run multiple tasks in the same 
JVM, which leads to this error. This is not a problem for Hadoop MapReduce 
because Hadoop runs each task in a separate JVM.

A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a standalone 
cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1486) Support multi-model training in MLlib

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1486:
-

Priority: Critical  (was: Major)

 Support multi-model training in MLlib
 -

 Key: SPARK-1486
 URL: https://issues.apache.org/jira/browse/SPARK-1486
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical
 Fix For: 1.1.0


 It is rare in practice to train just one model with a given set of 
 parameters. Usually, this is done by training multiple models with different 
 sets of parameters and then select the best based on their performance on the 
 validation set. MLlib should provide native support for multi-model 
 training/scoring. It requires decoupling of concepts like problem, 
 formulation, algorithm, parameter set, and model, which are missing in MLlib 
 now. MLI implements similar concepts, which we can borrow. There are 
 different approaches for multi-model training:
 0) Keep one copy of the data, and train models one after another (or maybe in 
 parallel, depending on the scheduler).
 1) Keep one copy of the data, and train multiple models at the same time 
 (similar to `runs` in KMeans).
 2) Make multiple copies of the data (still stored distributively), and use 
 more cores to distribute the work.
 3) Collect the data, make the entire dataset available on workers, and train 
 one or more models on each worker.
 Users should be able to choose which execution mode they want to use. Note 
 that 3) could cover many use cases in practice when the training data is not 
 huge, e.g., 1GB.
 This task will be divided into sub-tasks and this JIRA is created to discuss 
 the design and track the overall progress.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1553) Support alternating nonnegative least-squares

2014-05-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1553:
-

Priority: Major  (was: Minor)

 Support alternating nonnegative least-squares
 -

 Key: SPARK-1553
 URL: https://issues.apache.org/jira/browse/SPARK-1553
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 0.9.0
Reporter: Tor Myklebust
Assignee: Tor Myklebust
 Fix For: 1.1.0


 There's already an ALS implementation.  It can be tweaked to support 
 nonnegative least-squares by conditionally running a nonnegative 
 least-squares solve instead of a least-squares solver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1782) svd for sparse matrix using ARPACK

2014-05-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000620#comment-14000620
 ] 

Xiangrui Meng commented on SPARK-1782:
--

If you need the the latest Breeze to use eigs, I would prefer calling ARPACK 
directly.

 svd for sparse matrix using ARPACK
 --

 Key: SPARK-1782
 URL: https://issues.apache.org/jira/browse/SPARK-1782
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Li Pu
   Original Estimate: 672h
  Remaining Estimate: 672h

 Currently the svd implementation in mllib calls the dense matrix svd in 
 breeze, which has a limitation of fitting n^2 Gram matrix entries in memory 
 (n is the number of rows or number of columns of the matrix, whichever is 
 smaller). In many use cases, the original matrix is sparse but the Gram 
 matrix might not, and we often need only the largest k singular 
 values/vectors. To make svd really scalable, the memory usage must be 
 propositional to the non-zero entries in the matrix. 
 One solution is to call the de facto standard eigen-decomposition package 
 ARPACK. For an input matrix M, we compute a few eigenvalues and eigenvectors 
 of M^t*M (or M*M^t if its size is smaller) using ARPACK, then use the 
 eigenvalues/vectors to reconstruct singular values/vectors. ARPACK has a 
 reverse communication interface. The user provides a function to multiply a 
 square matrix to be decomposed with a dense vector provided by ARPACK, and 
 return the resulting dense vector to ARPACK. Inside ARPACK it uses an 
 Implicitly Restarted Lanczos Method for symmetric matrix. Outside what we 
 need to provide are two matrix-vector multiplications, first M*x then M^t*x. 
 These multiplications can be done in Spark in a distributed manner.
 The working memory used by ARPACK is O(n*k). When k (the number of desired 
 singular values) is small, it can be easily fit into the memory of the master 
 machine. The overall model is master machine runs ARPACK, and distribute 
 matrix-vector multiplication onto working executors in each iteration. 
 I made a PR to breeze with an ARPACK-backed svds interface 
 (https://github.com/scalanlp/breeze/pull/240). The interface takes anything 
 that can be multiplied by a DenseVector. On Spark/milib side, just need to 
 implement the sparsematrix-vector multiplication. 
 It might take some time to optimize and fully test this implementation, so 
 set the workload estimate to 4 weeks. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files

2014-05-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14000677#comment-14000677
 ] 

Xiangrui Meng commented on SPARK-1861:
--

Patch available at https://issues.apache.org/jira/browse/HADOOP-10614 .

 ArrayIndexOutOfBoundsException when reading bzip2 files
 ---

 Key: SPARK-1861
 URL: https://issues.apache.org/jira/browse/SPARK-1861
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
 implementation is not threadsafe and Spark may run multiple tasks in the same 
 JVM, which leads to this error. This is not a problem for Hadoop MapReduce 
 because Hadoop runs each task in a separate JVM.
 A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a 
 standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136
 ] 

Xiangrui Meng edited comment on SPARK-1859 at 5/18/14 5:27 PM:
---

The step size should be smaller than 1 over the Lipschitz constant L. Your 
example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 
1500. To make it converge, you need to set step size smaller than 
(1.0/1500/1500). Yes, it looks like a simple problem, but it is actually 
ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.


was (Author: mengxr):
The step size should be smaller than the Lipschitz constant L. Your example 
contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To 
make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, 
it looks like a simple problem, but it is actually ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.

 Linear, Ridge and Lasso Regressions with SGD yield unexpected results
 -

 Key: SPARK-1859
 URL: https://issues.apache.org/jira/browse/SPARK-1859
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
 Environment: OS: Ubuntu Server 12.04 x64
 PySpark
Reporter: Vlad Frolov
  Labels: algorithm, machine_learning, regression

 Issue:
 Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
 (example one).
 Ridge Regression with SGD *sometimes* works ok.
 Lasso Regression with SGD *sometimes* works ok.
 Code example (PySpark) based on 
 http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
 {code:title=regression_example.py}
 parsedData = sc.parallelize([
 array([2400., 1500.]),
 array([240., 150.]),
 array([24., 15.]),
 array([2.4, 1.5]),
 array([0.24, 0.15])
 ])
 # Build the model
 model = LinearRegressionWithSGD.train(parsedData)
 print model._coeffs
 {code}
 So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
 :)
 The resulting model has nan coeffs: {{array([ nan])}}.
 Furthermore, if you comment records line by line you will get:
 * [-1.55897475e+296] coeff (the first record is commented), 
 * [-8.62115396e+104] coeff (the first two records are commented),
 * etc
 It looks like the implemented regression algorithms diverges somehow.
 I get almost the same results on Ridge and Lasso.
 I've also tested these inputs in scikit-learn and it works as expected there.
 However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
 preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001136#comment-14001136
 ] 

Xiangrui Meng commented on SPARK-1859:
--

The step size should be smaller than the Lipschitz constant L. Your example 
contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To 
make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, 
it looks like a simple problem, but it is actually ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, 
while we didn't implement line search in LinearRegressionWithSGD. You can try 
LBFGS in the current master, which should work for your example.

 Linear, Ridge and Lasso Regressions with SGD yield unexpected results
 -

 Key: SPARK-1859
 URL: https://issues.apache.org/jira/browse/SPARK-1859
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 0.9.1
 Environment: OS: Ubuntu Server 12.04 x64
 PySpark
Reporter: Vlad Frolov
  Labels: algorithm, machine_learning, regression

 Issue:
 Linear Regression with SGD don't work as expected on any data, but lpsa.dat 
 (example one).
 Ridge Regression with SGD *sometimes* works ok.
 Lasso Regression with SGD *sometimes* works ok.
 Code example (PySpark) based on 
 http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
 {code:title=regression_example.py}
 parsedData = sc.parallelize([
 array([2400., 1500.]),
 array([240., 150.]),
 array([24., 15.]),
 array([2.4, 1.5]),
 array([0.24, 0.15])
 ])
 # Build the model
 model = LinearRegressionWithSGD.train(parsedData)
 print model._coeffs
 {code}
 So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! 
 :)
 The resulting model has nan coeffs: {{array([ nan])}}.
 Furthermore, if you comment records line by line you will get:
 * [-1.55897475e+296] coeff (the first record is commented), 
 * [-8.62115396e+104] coeff (the first two records are commented),
 * etc
 It looks like the implemented regression algorithms diverges somehow.
 I get almost the same results on Ridge and Lasso.
 I've also tested these inputs in scikit-learn and it works as expected there.
 However, I'm still not sure whether it's a bug or SGD 'feature'. Should I 
 preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1783) Title contains html code in MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001183#comment-14001183
 ] 

Xiangrui Meng commented on SPARK-1783:
--

Added `displayTitle` variable to the global layout. If this is defined, use it 
instead of `title` for page title in `h1`.

 Title contains html code in MLlib guide
 ---

 Key: SPARK-1783
 URL: https://issues.apache.org/jira/browse/SPARK-1783
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 We use 
 ---
 layout: global
 title: a href=mllib-guide.htmlMLlib/a - Clustering
 ---
 to create a link in the title to the main page of MLlib's guide. However, the 
 generated title contains raw html code, which shows up in the tab or title 
 bar of the browser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1871:
-

Summary: Improve MLlib guide for v1.0  (was: Improve MLlib guide)

 Improve MLlib guide for v1.0
 

 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1872) Update api links for unidoc

2014-05-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1872:


 Summary: Update api links for unidoc
 Key: SPARK-1872
 URL: https://issues.apache.org/jira/browse/SPARK-1872
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Reporter: Xiangrui Meng


Should use unidoc for API links.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1783) Title contains html code in MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1783:
-

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-1871

 Title contains html code in MLlib guide
 ---

 Key: SPARK-1783
 URL: https://issues.apache.org/jira/browse/SPARK-1783
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Minor

 We use 
 ---
 layout: global
 title: a href=mllib-guide.htmlMLlib/a - Clustering
 ---
 to create a link in the title to the main page of MLlib's guide. However, the 
 generated title contains raw html code, which shows up in the tab or title 
 bar of the browser.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1871) Improve MLlib guide for v1.0

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1871:
-

Description: More improvements to MLlib guide.

 Improve MLlib guide for v1.0
 

 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng

 More improvements to MLlib guide.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195
 ] 

Xiangrui Meng commented on SPARK-1870:
--

I specified the jar via `--jars` and add it with `sc.addJar` explicitly. In the 
Web UI, I see:

{code}
/mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar
 System Classpath
http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User
{code}

So it is in distributed cache as well as served by master via http. However, I 
still got ClassNotFoundException.

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Critical

 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001195#comment-14001195
 ] 

Xiangrui Meng edited comment on SPARK-1870 at 5/18/14 8:36 PM:
---

I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In 
the Web UI, I see:

{code}
/mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/
container_1398708946838_0152_01_01/hello_2.10.jar   System Classpath
http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User
{code}

So it is in distributed cache as well as served by master via http. However, I 
still got ClassNotFoundException.


was (Author: mengxr):
I specified the jar via `--jars` and added it with `sc.addJar` explicitly. In 
the Web UI, I see:

{code}
/mnt/yarn/nm/usercache/ubuntu/appcache/application_1398708946838_0152/container_1398708946838_0152_01_01/hello_2.10.jar
 System Classpath
http://10.45.133.8:43576/jars/hello_2.10.jarAdded By User
{code}

So it is in distributed cache as well as served by master via http. However, I 
still got ClassNotFoundException.

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Critical

 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1871) Improve MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1871:
-

Component/s: Documentation

 Improve MLlib guide
 ---

 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Reporter: Xiangrui Meng





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1871) Improve MLlib guide

2014-05-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1871:


 Summary: Improve MLlib guide
 Key: SPARK-1871
 URL: https://issues.apache.org/jira/browse/SPARK-1871
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1874) Clean up MLlib sample data

2014-05-18 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14001254#comment-14001254
 ] 

Xiangrui Meng commented on SPARK-1874:
--

Is `data/mllib` a better place than `mllib/data`?

 Clean up MLlib sample data
 --

 Key: SPARK-1874
 URL: https://issues.apache.org/jira/browse/SPARK-1874
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Matei Zaharia
 Fix For: 1.0.0


 - Replace logistic regression example data with linear to make 
 mllib.LinearRegression example easier to run
 - Move files from mllib/data into data/mllib to make them easier to find
 - Add a simple MovieLens data file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1874) Clean up MLlib sample data

2014-05-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002535#comment-14002535
 ] 

Xiangrui Meng commented on SPARK-1874:
--

There are three files under `data/`: `kmeans_data.txt`, `lr_data.txt`, and 
`pagerank_data.txt`, while more files under `mllib/data`. It feels more natural 
to me to keep the sample data under `mllib/data`. Anyway, I will create sample 
data first.

 Clean up MLlib sample data
 --

 Key: SPARK-1874
 URL: https://issues.apache.org/jira/browse/SPARK-1874
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Matei Zaharia
 Fix For: 1.0.0


 - Replace logistic regression example data with linear to make 
 mllib.LinearRegression example easier to run
 - Move files from mllib/data into data/mllib to make them easier to find
 - Add a simple MovieLens data file



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-19 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14002613#comment-14002613
 ] 

Xiangrui Meng commented on SPARK-1870:
--

I tested it on a Spark 1.0RC standalone cluster but didn't find the same 
problem. Classes are loaded properly.

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Critical

 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files

2014-05-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1861.
--

Resolution: Implemented

Patch will be included for the next Hadoop release (1.3.0, 2.5.0).

 ArrayIndexOutOfBoundsException when reading bzip2 files
 ---

 Key: SPARK-1861
 URL: https://issues.apache.org/jira/browse/SPARK-1861
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
 implementation is not threadsafe and Spark may run multiple tasks in the same 
 JVM, which leads to this error. This is not a problem for Hadoop MapReduce 
 because Hadoop runs each task in a separate JVM.
 A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a 
 standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1870:
-

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-1905

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Xiangrui Meng
Priority: Critical
 Fix For: 1.0.0


 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1906:


 Summary: spark-submit doesn't send master URL to Driver in 
standalone cluster mode
 Key: SPARK-1906
 URL: https://issues.apache.org/jira/browse/SPARK-1906
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Reporter: Xiangrui Meng
Priority: Minor


`spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter 
creates an empty SparkConf and throws an exception:

{code}
A master URL must be set in your configuration
{code}

The workaround is setting master explicitly in the user application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1908) Support local app jar in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1908:


 Summary: Support local app jar in standalone cluster mode
 Key: SPARK-1908
 URL: https://issues.apache.org/jira/browse/SPARK-1908
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Reporter: Xiangrui Meng
Priority: Minor


Standalone cluster mode only supports app jar with a URL that is accessible 
from all cluster nodes, e.g., a jar on HDFS or a local file that is available 
on each cluster nodes. It would be nice to support app jar that is only 
available on the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1909) --jars is not supported in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-1909:


 Summary: --jars is not supported in standalone cluster mode
 Key: SPARK-1909
 URL: https://issues.apache.org/jira/browse/SPARK-1909
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Minor


--jars is not processed in `spark-submit` for standalone cluster mode. The 
workaround is building an assembly app jar. It might be easy to support user 
jars that is accessible from a cluster node by setting `spark.jars` property 
correctly and passing it to `DriverWrapper`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1900) Fix running PySpark files on YARN

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1900:
-

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-1652

 Fix running PySpark files on YARN 
 --

 Key: SPARK-1900
 URL: https://issues.apache.org/jira/browse/SPARK-1900
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.0


 If I run the following on a YARN cluster
 {code}
 bin/spark-submit sheep.py --master yarn-client
 {code}
 it fails because of a mismatch in paths: `spark-submit` thinks that 
 `sheep.py` resides on HDFS, and balks when it can't find the file there. A 
 natural workaround is to add the `file:` prefix to the file:
 {code}
 bin/spark-submit file:/path/to/sheep.py --master yarn-client
 {code}
 However, this also fails. This time it is because python does not understand 
 URI schemes.
 This PR fixes this by automatically resolving all paths passed as command 
 line argument to `spark-submit` properly. This has the added benefit of 
 keeping file and jar paths consistent across different cluster modes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1900) Fix running PySpark files on YARN

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1900:
-

Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-1905)

 Fix running PySpark files on YARN 
 --

 Key: SPARK-1900
 URL: https://issues.apache.org/jira/browse/SPARK-1900
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.0.0


 If I run the following on a YARN cluster
 {code}
 bin/spark-submit sheep.py --master yarn-client
 {code}
 it fails because of a mismatch in paths: `spark-submit` thinks that 
 `sheep.py` resides on HDFS, and balks when it can't find the file there. A 
 natural workaround is to add the `file:` prefix to the file:
 {code}
 bin/spark-submit file:/path/to/sheep.py --master yarn-client
 {code}
 However, this also fails. This time it is because python does not understand 
 URI schemes.
 This PR fixes this by automatically resolving all paths passed as command 
 line argument to `spark-submit` properly. This has the added benefit of 
 keeping file and jar paths consistent across different cluster modes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1870) Jars specified via --jars in spark-submit are not added to executor classpath for YARN

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1870:
-

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-1652

 Jars specified via --jars in spark-submit are not added to executor classpath 
 for YARN
 --

 Key: SPARK-1870
 URL: https://issues.apache.org/jira/browse/SPARK-1870
 Project: Spark
  Issue Type: Sub-task
  Components: YARN
Reporter: Xiangrui Meng
Priority: Critical
 Fix For: 1.0.0


 With `spark-submit`, jars specified via `--jars` are added to distributed 
 cache in `yarn-cluster` mode. The executor should add cached jars to 
 classpath. However, 
 {code}
 sc.parallelize(0 to 10, 10).map { i =
   System.getProperty(java.class.path)
 }.collect().foreach(println)
 {code}
 shows only system jars, `app.jar`, and `spark.jar` but not other jars in the 
 distributed cache.
 The workaround is using assembly jar.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1906:
-

Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-1905)

 spark-submit doesn't send master URL to Driver in standalone cluster mode
 -

 Key: SPARK-1906
 URL: https://issues.apache.org/jira/browse/SPARK-1906
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Xiangrui Meng
Priority: Minor

 `spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter 
 creates an empty SparkConf and throws an exception:
 {code}
 A master URL must be set in your configuration
 {code}
 The workaround is setting master explicitly in the user application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1909) --jars is not supported in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1909:
-

Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-1905)

 --jars is not supported in standalone cluster mode
 

 Key: SPARK-1909
 URL: https://issues.apache.org/jira/browse/SPARK-1909
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Priority: Minor

 --jars is not processed in `spark-submit` for standalone cluster mode. The 
 workaround is building an assembly app jar. It might be easy to support user 
 jars that is accessible from a cluster node by setting `spark.jars` property 
 correctly and passing it to `DriverWrapper`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1908) Support local app jar in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1908:
-

Issue Type: Bug  (was: Sub-task)
Parent: (was: SPARK-1905)

 Support local app jar in standalone cluster mode
 

 Key: SPARK-1908
 URL: https://issues.apache.org/jira/browse/SPARK-1908
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Xiangrui Meng
Priority: Minor

 Standalone cluster mode only supports app jar with a URL that is accessible 
 from all cluster nodes, e.g., a jar on HDFS or a local file that is available 
 on each cluster nodes. It would be nice to support app jar that is only 
 available on the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1908) Support local app jar in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1908:
-

Issue Type: Sub-task  (was: Bug)
Parent: SPARK-1652

 Support local app jar in standalone cluster mode
 

 Key: SPARK-1908
 URL: https://issues.apache.org/jira/browse/SPARK-1908
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Reporter: Xiangrui Meng
Priority: Minor

 Standalone cluster mode only supports app jar with a URL that is accessible 
 from all cluster nodes, e.g., a jar on HDFS or a local file that is available 
 on each cluster nodes. It would be nice to support app jar that is only 
 available on the client.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1906) spark-submit doesn't send master URL to Driver in standalone cluster mode

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-1906:
-

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-1652

 spark-submit doesn't send master URL to Driver in standalone cluster mode
 -

 Key: SPARK-1906
 URL: https://issues.apache.org/jira/browse/SPARK-1906
 Project: Spark
  Issue Type: Sub-task
  Components: Deploy
Reporter: Xiangrui Meng
Priority: Minor

 `spark-submit` doesn't send `spark.master` to DriverWrapper. So the latter 
 creates an empty SparkConf and throws an exception:
 {code}
 A master URL must be set in your configuration
 {code}
 The workaround is setting master explicitly in the user application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1905) Issues with `spark-submit`

2014-05-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-1905.
--

Resolution: Duplicate

 Issues with `spark-submit`
 --

 Key: SPARK-1905
 URL: https://issues.apache.org/jira/browse/SPARK-1905
 Project: Spark
  Issue Type: Task
  Components: Deploy
Reporter: Xiangrui Meng

 Main JIRA for tracking the issues with `spark-submit` in different deploy 
 modes.
 For deploy modes, `client` means running driver on the client side (outside 
 the cluster), while `cluster` means running driver inside the cluster. 
 Ideally, we want to have `spark-submit` support all deploy modes in a unified 
 way, which include
 * documented
 ** local client
 ** standalone client
 ** standalone cluster
 ** yarn client
 ** yarn cluster
 ** mesos client
 * undocumented (for testing only)
 ** local cluster
 * not supported
 ** mesos cluster
 In each deploy mode, we also need to test launching python jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   3   4   5   6   7   8   9   10   >