Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
I don't think it is easy to make sparse faster than dense with this sparsity and feature dimension. You can try rcv1.binary, which should show the difference easily. David, the breeze operators used here are 1. DenseVector dot SparseVector 2. axpy DenseVector SparseVector However, the

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
Hi DB, I saw you are using yarn-cluster mode for the benchmark. I tested the yarn-cluster mode and found that YARN does not always give you the exact number of executors requested. Just want to confirm that you've checked the number of executors. The second thing to check is that in the

Re: Fw: Is there any way to make a quick test on some pre-commit code?

2014-04-24 Thread Prashant Sharma
Not sure but I use sbt/sbt ~compile instead of package. Any reason we use package instead of compile(which is slightly faster ofc.) Prashant Sharma On Thu, Apr 24, 2014 at 1:32 PM, Patrick Wendell pwend...@gmail.com wrote: This is already on the wiki:

Problem creating objects through reflection

2014-04-24 Thread Piotr Kołaczkowski
Hi, I'm working on Cassandra-Spark integration and I hit a pretty severe problem. One of the provided functionality is mapping Cassandra rows into objects of user-defined classes. E.g. like this: class MyRow(val key: String, val data: Int) sc.cassandraTable(keyspace, table).select(key,

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread DB Tsai
Hi Xiangrui, Yes, I'm using yarn-cluster mode, and I did check # of executors I specified are the same as the actual running executors. For caching and materialization, I've the timer in optimizer after calling count(); as a result, the time for materialization in cache isn't in the benchmark.

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread Xiangrui Meng
I don't understand why sparse falls behind dense so much at the very first iteration. I didn't see count() is called in https://github.com/dbtsai/spark-lbfgs-benchmark/blob/master/src/main/scala/org/apache/spark/mllib/benchmark/BinaryLogisticRegression.scala . Maybe you have local uncommitted

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread DB Tsai
I'm doing the timer in runMiniBatchSGD after val numExamples = data.count() See the following. Running rcv1 dataset now, and will update soon. val startTime = System.nanoTime() for (i - 1 to numIterations) { // Sample a subset (fraction miniBatchFraction) of the total data

Re: MLlib - logistic regression with GD vs LBFGS, sparse vs dense benchmark result

2014-04-24 Thread DB Tsai
rcv1.binary is too sparse (0.15% non-zero elements), so dense format will not run due to out of memory. But sparse format runs really well. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai

Re: Problem creating objects through reflection

2014-04-24 Thread Michael Armbrust
The Spark REPL is slightly modified from the normal Scala REPL to prevent work from being done twice when closures are deserialized on the workers. I'm not sure exactly why this causes your problem, but its probably worth filing a JIRA about it. Here is another issues with classes defined in the