from:"debasish83"

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...

2017-05-10 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r115745818
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala ---
@@ -154,22 +159,23 @@ class LinearSVCSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 
   test("linearSVC with sample weights") {
 def modelEquals(m1: LinearSVCModel, m2: LinearSVCModel): Unit = {
-  assert(m1.coefficients ~== m2.coefficients absTol 0.05)
+  assert(m1.coefficients ~== m2.coefficients absTol 0.07)
   assert(m1.intercept ~== m2.intercept absTol 0.05)
 }
-
-val estimator = new LinearSVC().setRegParam(0.01).setTol(0.01)
-val dataset = smallBinaryDataset
-MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, modelEquals)
-MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 
3)
-MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, modelEquals, 42L)
+LinearSVC.supportedOptimizers.foreach { opt =>
+  val estimator = new 
LinearSVC().setRegParam(0.02).setTol(0.01).setSolver(opt)
+  val dataset = smallBinaryDataset
+  MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, modelEquals)
+  MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio 
= 3)
+  MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, modelEquals, 42L)
+}
   }
 
-  test("linearSVC comparison with R e1071 and scikit-learn") {
-val trainer1 = new LinearSVC()
+  test("linearSVC OWLQN comparison with R e1071 and scikit-learn") {
+val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN)
   .setRegParam(0.2) // set regParam = 2.0 / datasize / c
--- End diff --

@hhbyyh I saw some posts that hinge loss is not differentiable but squared 
hinge loss is for practical purposes...can you please point to a reference on 
squared hinge loss ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...

2017-05-10 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r115741206
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala ---
@@ -154,22 +159,23 @@ class LinearSVCSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 
   test("linearSVC with sample weights") {
 def modelEquals(m1: LinearSVCModel, m2: LinearSVCModel): Unit = {
-  assert(m1.coefficients ~== m2.coefficients absTol 0.05)
+  assert(m1.coefficients ~== m2.coefficients absTol 0.07)
   assert(m1.intercept ~== m2.intercept absTol 0.05)
 }
-
-val estimator = new LinearSVC().setRegParam(0.01).setTol(0.01)
-val dataset = smallBinaryDataset
-MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, modelEquals)
-MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 
3)
-MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, modelEquals, 42L)
+LinearSVC.supportedOptimizers.foreach { opt =>
+  val estimator = new 
LinearSVC().setRegParam(0.02).setTol(0.01).setSolver(opt)
+  val dataset = smallBinaryDataset
+  MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, modelEquals)
+  MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio 
= 3)
+  MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, modelEquals, 42L)
+}
   }
 
-  test("linearSVC comparison with R e1071 and scikit-learn") {
-val trainer1 = new LinearSVC()
+  test("linearSVC OWLQN comparison with R e1071 and scikit-learn") {
+val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN)
   .setRegParam(0.2) // set regParam = 2.0 / datasize / c
--- End diff --

hinge loss is not differentiable...how are you smoothing it before you can 
use a quasi newton solversince the papers smooth the max, a 
newton/quasi-newton solver should work well...if you are keeping the 
non-differentiable loss better will be to use a sub-gradient solver as 
suggested by the talk...I will evaluate the formulation...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...

2017-05-10 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/17862#discussion_r115659479
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala ---
@@ -154,22 +159,23 @@ class LinearSVCSuite extends SparkFunSuite with 
MLlibTestSparkContext with Defau
 
   test("linearSVC with sample weights") {
 def modelEquals(m1: LinearSVCModel, m2: LinearSVCModel): Unit = {
-  assert(m1.coefficients ~== m2.coefficients absTol 0.05)
+  assert(m1.coefficients ~== m2.coefficients absTol 0.07)
   assert(m1.intercept ~== m2.intercept absTol 0.05)
 }
-
-val estimator = new LinearSVC().setRegParam(0.01).setTol(0.01)
-val dataset = smallBinaryDataset
-MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, modelEquals)
-MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 
3)
-MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC](
-  dataset.as[LabeledPoint], estimator, modelEquals, 42L)
+LinearSVC.supportedOptimizers.foreach { opt =>
+  val estimator = new 
LinearSVC().setRegParam(0.02).setTol(0.01).setSolver(opt)
+  val dataset = smallBinaryDataset
+  MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, modelEquals)
+  MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio 
= 3)
+  MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, 
LinearSVC](
+dataset.as[LabeledPoint], estimator, modelEquals, 42L)
+}
   }
 
-  test("linearSVC comparison with R e1071 and scikit-learn") {
-val trainer1 = new LinearSVC()
+  test("linearSVC OWLQN comparison with R e1071 and scikit-learn") {
+val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN)
   .setRegParam(0.2) // set regParam = 2.0 / datasize / c
--- End diff --

This slides also explain it...Please see slide 32...the max can be replaced 
by soft-max with the softness lambda can be tuned...log-sum-exp is a standard 
soft-max that can be used which is similar to ReLu functions and we can re-use 
it from MLP:
ftp://ftp.cs.wisc.edu/math-prog/talks/informs99ssv.ps
ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.pdf
I can add the formulation if there is interest...it needs some tuning for 
soft-max parameter but the convergence will be good with LBFGS (OWLQN is not 
needed)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for LinearSV...

2017-05-06 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/17862
  
@hhbyyh can we smooth the hinge-loss using soft-max (variant of ReLU) and 
then use LBFGS ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...

2016-12-27 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/12574
  
test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #14473: [SPARK-16495] [MLlib]Add ADMM optimizer in mllib package

2016-12-25 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/14473
  
ADMM is already available as a breeze solver (BFGS, OWLQN, 
NonlinearMinimizer) which is integrated with ml/mllib...It will be great if you 
can look into it and let me know if you need pointers in running comparisons 
with OWLQN:

https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala
This is implemented based on the paper you cited.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...

2016-12-25 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/12574
  
Can we close it ? Looks like SPARK-18235 opened up recommendForAll 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...

2016-12-25 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/12574
  
test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...

2016-08-06 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/12574
  
I will take a pass at the PR as well..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...

2016-08-06 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/12574
  
@MLnick I recently visited IBM STC but unfortunately missed you on the 
meeting...we discussed about the ML/MLlib changes for matrix factorization...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #458: [SPARK-1543][MLlib] Add ADMM for solving Lasso (and elasti...

2016-06-29 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/458
  
ADMM is already implemented as part of Breeze proximal NonlinearMinimizer 
where the ADMM solver stays in master and gradient calculator is used in 
similar manner as how Breeze LBFGS/OWLQN has been plugged in...I did not open 
up a PR since OWLQN has been chosen for L1 logistic...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #1110: [SPARK-2174][MLLIB] treeReduce and treeAggregate

2016-06-05 Thread debasish83

Github user debasish83 commented on the issue:

https://github.com/apache/spark/pull/1110
  
@mengxr say I have 20 nodes and 16 cores on each node, do you recommend 
running treeReduce with 320 partitions and OpenBLAS with numThreads=1 on each 
partition for SeqOp OR treeReduce with 20 partitions and OpenBLAS with 
numThreads=16 on each partition for SeqOp...Do you have plans on further 
improvement ideas of decreasing network shuffle using treeReduce/treeAggregate 
or if there is a JIRA open so that we can move the discussion on it ? Looks 
like shuffle is compressed by default on Spark using snappy already...do you 
recommend compressing the vector logically ?

SparkContext: 20 nodes, 16 cores, sc.defaultParallelism 320

def gramSize(n: Int) = (n*n+1)/2

val combOp = (v1: Array[Float], v2: Array[Float]) => {
  var i = 0
  while (i < v1.length) {
v1(i) += v2(i)
i += 1
  }
  v1
}

val n = gramSize(4096)
val vv = sc.parallelize(0 until sc.defaultParallelism).map(i => 
Array.fill[Float](n)(0))

Option 1: 320 partitions, 1 thread on combOp per partition

val start = System.nanoTime(); 
vv.treeReduce(combOp, 2); 
val reduceTime = (System.nanoTime() - start)*1e-9
reduceTime: Double = 5.639030243006

Option 2: 20 partitions, 1 thread on combOp per partition

val coalescedvv = vv.coalesce(20)
coalescedvv.count

val start = System.nanoTime(); 
coalescedvv.treeReduce(combOp, 2); 
val reduceTime = (System.nanoTime() - start)*1e-9
reduceTime: Double = 3.914068564004

Option 3: 20 partitions, OpenBLAS numThread=16 per partition

Setting up OpenBLAS on cluster, I will update soon.

Let me know your thoughts. I think if underlying operations are Dense BLAS 
level1, level2 or level3, running with higher OpenBLAS threads and reducing 
number of partitions should help in decreasing cross partition shuffle.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-12-05 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5869#issuecomment-162240882
  
@srowen actually I am not sure if MAP calculation got added in ML pipeline 
or not...I will look into it and if someone else already added it, I will close 
the PR


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-08-31 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-136503426
  
@rezazadeh got busy with spark streaming version of KNN :-) I will open up 
2 PRs over the weekend as we discussed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][WIP] SPARK-4638: Kernels feature for M...

2015-07-11 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5503#issuecomment-120658511
  
@dbtsai @mandar2812 I found the abstraction for kernel as explained in my 
PR https://github.com/apache/spark/pull/6213 more generic in practical 
use-cases compared to the usual interface available in scikit-learn...It will 
be great if we can come up with a strategy such that this PR calls 
IndexedRowMatrix.rowSimilarity to get the kernel from the data as represented 
with RDD[LabeledPoint] 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-06-06 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-109654217
  
Internally we are using this code for euclidean/rbf driving PIC for 
example...but sure we can focus on cosine first...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-05-30 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-107113056
  
@rezazadeh sure I will do thatCould you add a JIRA for 3 (Kernel 
Clustering / PIC) so that we can add RBFKernel flow and implement PIC with 
vector - matrix multiply for comparisons ? Also in general topK can decrease 
the kernel size and is a cross validation parameter to see the degradation of 
the clustering compared to full kernel which is always difficult to keep as the 
rows grow...No such experiments have been done for PIC.

I am experimenting with gemv based optimization for SparseVector x 
SparseMatrix and if I get further speedup compared to level 1 flow most likely 
we will provide both options to the users in SPARK-4823.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-24 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-105026856
  
Let's continue the validation discussion on 
https://github.com/apache/spark/pull/6213. The PR introduces batch gemm based 
similarity computation in MatrixFactorizationModel using kernel abstraction. Do 
need the online version as well that Steven added or it can be extracted out of 
batch results ? My focus was more on speeding up batch computation...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-05-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-104968079
  
Runtime comparison are posted on SPARK-4823 on MovieLens1m dataset, 8 core, 
4 GB executor memory from my laptop.

Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s
Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins

I have not yet gone to gemv which will decrease the runtime further but 
will add some approximations in RBFKernel. I think for users we should give 
both vector based flow and gemv based flow to let them choose what they want.

I updated the driver code in examples.mllib.MovieLensSimilarity

@MLnick @sowen could you please take a look at 
examples.mllib.MovieLensSimilarity ? I am running ALS in implicit mode with no 
regularization (basically full RMSE optimization) and comparing similarities as 
generated from raw features and item similarities. 

I get topK=50 from raw features as golden labels and find MAP on top50 
predictions from MatrixFactorizationModel.similarItems() that this PR added.

I will add a testcase for RBFKernel and add the PowerIterationClustering 
driver to use IndexedRowMatrix.rowSimilarities code before taking out WIP label 
from the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-05-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-104970859
  
Refactoring MatrixFactorizationModel.recommendForAll to a common place like 
Vectors/Matrices will help users who have dense data with modest columns 
(~1000-10K, most IoT data falls in this category) reuse dgemm based kernel 
computation. I am not sure which is a good place for this code ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-05-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-104936678
  
Internally vector flow in IndexedRowMatrix has helped us to do additional 
optimization through user defined kernels and cut the computation which won't 
happen if we go to dgemv since the matrix compute will be done first before 
filters based on norm can be applied...I think we should keep the vector based 
kernel compute and get user feedback first...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...

2015-05-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-104934928
  
@mengxr I generalized MatrixFactorizationModel.recommendAll and use it for 
similarUsers and similarProducts and use dgemm...In IndexedRowMatrix I only 
exposed rowSimilarity as the public API and it uses blocked BLAS level-1 
computation...It is easy to use gemv in IndexedRowMatrix.rowSimilarity for 
CosineKernel but for RBFKernel things will get tricky since for sparse vector, 
I don't think we can write euclidean distance as norm1*norm1 + norm2*norm2 - 2 
dot(x, y) without letting go of some accuracy which might be ok compared to 
runtime benefits...I am looking further into RBF computation using dgemv...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-20 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-103771669
  
Actually both for Euclidean and RBF it is possible as || x - y || can be 
decomposed as ||x||2 + ||y||2 - 2*dot(x,y) where dot(x,y) can be computed 
through dgemv...dgemm we can't use yet since BLAS does not have SparseMatrix x 
SparseMatrix...Is there a open PR for it ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-20 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-103925316
  
For gemv it is not clear how to re-use the scratch space for result 
vector...if we can't reuse the result vector over multiple calls to 
kernel.compute we won't get much runtime benefits...I am considering that for 
Vector based IndexedRowMatrix, we define the kernel as the traditional (vector, 
vector) compute and use level 1 BLAS as done in this PR. The big runtime 
benefit will come from Approximate KNN that I will open up next but we still 
need the brute-force KNN for cross validation.

For (Long, Array[Double]) from matrix factorization model (similarUsers and 
similarProducts) we can use dgemm specifically for DenseMatrix x 
DenseMatrix...@mengxr what do you think ? That way we can use dgemm when the 
features are Dense..Also (Long, Array[Double]) data structure can be defined in 
recommendation/linalg package and re-used by dense kernel computation Or 
perhaps for similarity/KNN computation it is fine to stay in vector space and 
not do gemv/gemm optimization?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-19 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-103615439
  
I am thinking more. May be EuclideanKernel can be decomposed using Matrix x 
Vector as well


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-19 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-103614290
  
SparseMatrix x SparseVector got merged to Master today 
https://github.com/apache/spark/pull/6209. 

I will update the PR and separate the code path for 
CosineKernel/ProductKernel and EuclideanKernel/RBFKernel to see the runtime 
improvements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-17 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-102841783
  
@mengxr the failures are related to yarn suite which does not look related 
to my changes...tests I added ran fine...
[info] *** 1 TEST FAILED ***
[error] Failed: Total 39, Failed 1, Errors 0, Passed 38
[error] Failed tests:
[error] org.apache.spark.deploy.yarn.YarnClusterSuite


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-7681][MLlib] Add SparseVector support f...

2015-05-17 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6209#issuecomment-102840355
  
Are there runtime comparisons posted with vector*vector operations for 
these changes BLAS-1 vs BLAS-2 ? SparseMatrix * SparseVector compared to 
Array[SparseVector] x SparseVector


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-17 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/6213#issuecomment-102843964
  
For CosineKernel and ProductKernel, we should be able to have a separate 
code path with BLAS-2 once SparseMatrix x SparseVector merges and BLAS-3 once 
SparseMatrix x SparseMatrix merges..Basically refactor blockify from 
MatrixFactorizationModel to IndexedRowMatrix...Right now the sparse features 
are not in master yet...For Euclidean, RBF and Pearson, even with these changes 
merged, I think we still have to stay in BLAS-1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

2015-05-16 Thread debasish83

GitHub user debasish83 opened a pull request:

https://github.com/apache/spark/pull/6213

[MLLIB][SPARK-4675, SPARK-4823] RowSimilarity

@mengxr @srowen
For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum 
sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs 
a CoordinateMatrix. For matrices where columns are  1M, rowSimilarity flow 
scales better compared to column similarity flow.

For most applications, topK similar items requirement is much less than all 
available items and therefore the rowSimilarity API takes topK and threshold as 
input. topK and threshold help in improving shuffle space.

For MatrixFactorization model generally the columns for both user and 
product factors are ~50-200 and therefore the column similarity flow does not 
work for such cases. This PR also adds batch similarUsers and similarProducts 
(SPARK-4675).

The following ideas are added:

1. Similarity computation is abstracted as Kernel
2. Kernel implementations for Cosine, RBF, Euclidean and Product (for 
distributed matrix multiply) are added
3. Tests cover Cosine Kernel. More tests will be added for Euclidean, RBF 
and Product kernels.
4. IndexedRowMatrix object adds a kernalized distributed matrix multiply 
which is used by similarity computation.
5. In examples, MovieLensSimilarity is added that shows col and row based 
flows on MovieLens as runtime experiment.
6. Level-1 BLAS is used so that kernel abstraction can be used. We can 
either design the Kernel abstraction with Level-3 BLAS (might be difficult) or 
use BlockMatrix for distributed matrix multiply.

Next steps:

1. In MovieLensSimilarity add ALS + similarItems example
2. Use RBF similarity in power iteration clustering flow

From internal experiments, we have run 6M users, 1.6M items, 351M ratings 
through row similarity flow with topK=200 in 1.1 hr with 240 cores running over 
30 nodes. We had difficult time in scaling column similarity flow since the 
topK optimization can't be added until reduce phase is done in that flow.

On MovieLens-1M and Netflix dataset I will report row and col similarity 
runtime comparisons.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/debasish83/spark similarity

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/6213.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #6213


commit f9fd6fbfb1a55142a9eb8f2129d3729ca25ab501
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-17T00:05:52Z

blocked kernalized row similarity calculation and tests

commit 66176f9f346c324b9c77c252be369e24f7fdd991
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-17T00:06:36Z

Cosine, Euclidean, RBF and Product Kernel added

commit 3f96963f80a40f3a4fce6b6dbd97c20605ebaecc
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-17T00:07:28Z

row similarity API added to drive MatrixFactorizationModel similarUsers and 
similarProducts

commit 6dc9e18d507cfe0d2ee12e768ca6bddb5c3c4b38
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-17T00:09:24Z

MovieLens flow to demonstrate item similarity calculation using raw 
features and ALS factors

commit 71f24a4629cf54c39af4e9e598d9808d85952532
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-17T00:09:45Z

import cleanup

commit cc4e104b7430e3fe2e6bf71489638321076428a3
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-17T00:11:15Z

Merge branch 'similarity' of https://github.com/debasish83/spark into 
similarity




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-05 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-99098372
  
@MLnick yes that's what I did...I have to convince users why use factor 
vectors :-) For user-item recommendation, convincing is easy by showing the 
ranking improvement through ALS

@srowen without coming up with a validation strategy, someone might propose 
to run a different algorithm (KMeans on raw feature space followed by 
(item-cluster) join (cluster-items)) and claims his item-item results are 
better...how do we know whether ALS based flow is producing better result or 
KMeans based flow ? NNALS can be thought of soft-kmeans as well and so these 
flows are very similar.

I am focused on implicit feedback here because then only we can run either 
KMeans or Similarity on raw feature space...With explicit feedback, I agree 
that cosine similarity is not valid in original feature space. But in most 
practical datasets, we are dealing with implicit feedback. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-04 Thread debasish83

GitHub user debasish83 reopened a pull request:

https://github.com/apache/spark/pull/5869

[SPARK-4231][MLLIB][Examples] MAP calculation added to examples.MovieLensALS

MAP calculation driver to MovieLensALS was not part of SPARK-3066 merge. 
Added the driver in this PR.

@mengxr the results changed compared to my old runs. Any idea if some 
internal ALS tuning has changed (I remember per user regularization change for 
implicit feedback but that should not change explicit results) ?

MAP calculation:

./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics map ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800163, test: 200046.
Test users 6035 MAP 0.019697998843987024

RMSE calculation:

./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics rmse ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800116, test: 200093.
Test RMSE = 0.8558133665979457


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/debasish83/spark irmetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5869.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5869


commit 9b3951f558e5673eb475c575f14876421b5a3abc
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-05T01:23:09Z

validate user/product on MovieLens dataset through user input and compute 
map measure along with rmse

commit cd3ab31cb9b244bae2b45396a6269ed1dc59151b
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-05T22:43:11Z

merged with AbstractParams serialization bug

commit 4bbae0f248ca8747b47ecf852d5aba19c9b39dab
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-05T23:23:02Z

comments fixed as per scalastyle

commit 9fa063e1eb172d68248e03797a54acc738543592
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-06T00:05:24Z

import scala.math.round

commit 10cbb37a7881867d801ae6630ffc0d09b3feebf9
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-08T06:31:40Z

provide ratio for topN product validation; generate MAP and prec@k metric 
for movielens dataset

commit f38a1b59e27907f2aa9bd732c5f9147b738d3a0f
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-08T06:45:13Z

use sampleByKey for per user sampling

commit d144f57a58c9424365f1242f90961386c016641e
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-12T04:56:46Z

recommendAll API to MatrixFactorizationModel, uses topK finding using 
BoundedPriorityQueue similar to RDD.top

commit 7163a5c21b394d8bd89694a9f08aa1b446c71956
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-19T21:58:45Z

Added API for batch user and product recommendation; MAP calculation for 
product recommendation per user using randomized split

commit 3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-19T23:38:45Z

fixed spark coding style for imports

commit ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9
Author: Debasish Das debasish@one.verizon.com
Date:   2015-04-01T01:52:27Z

addressed initial review comments;merged with master;added tests for batch 
predict APIs in matrix factorization

commit 98fa4243dc6041290bdde51e1e899a8be7576470
Author: Debasish Das debasish@one.verizon.com
Date:   2015-04-01T01:59:57Z

updated with master

commit 3a0c4eb7f81ee0845f4945d395f6652c965f941b
Author: Debasish Das debasish@one.verizon.com
Date:   2015-04-01T04:31:01Z

updated with spark master

commit 3640409ac2dd2ea7ab5e67a520726f2387d137e3
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-02T23:17:45Z

MAP calculation driver added to MovieLensALS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-04 Thread debasish83

Github user debasish83 closed the pull request at:

https://github.com/apache/spark/pull/5869


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-04 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5869#issuecomment-98827606
  
@mengxr if you could please point to the ML pipeline module where I should 
add it, I can do the change...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-03 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5869#issuecomment-98504419
  
Implicit lambda should not affect the explicit resultsI will take a 
closer look into the recommendForAll and compare with my old version..


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-03 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5869#issuecomment-98491289
  
@srowen ideally we should move both the utilities to compute rmse and MAP 
on a MatrixFactorizationModel to a common place from examples since they are 
the APIs that user can directly call during the model cross validation..may be 
it can be moved into ml pipeline ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-03 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5869#issuecomment-98502996
  
Stats from my old run:

./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics map ~/datasets/ml-1m/ratings.dat

rank = default

Got 1000209 ratings from 6040 users on 3706 movies.

Training: 800187, test: 200022.
Test users 6035 MAP 0.03499984595868497



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-03 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5869#issuecomment-98504679
  
RMSE is similar in my old runs..so the ALS core did not change...the MAP 
driver code is also same since I just migrated it from my PR.

TUSCA09LMLVT00C:spark-irmetrics v606014$ ./bin/spark-submit --master 
spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics rmse ~/datasets/ml-1m/ratings.dat
2015-05-03 09:58:04.904 java[33124:1903] Unable to load realm mapping info 
from SCDynamicStore
15/05/03 09:58:06 WARN NativeCodeLoader: Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 800952, test: 199257.
Test RMSE = 0.8558204583570717

I will compare the recommendForAll output from my branch and the merged 
code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...

2015-05-02 Thread debasish83

GitHub user debasish83 opened a pull request:

https://github.com/apache/spark/pull/5869

[SPARK-4231][MLLIB][Examples] MAP calculation added to examples.MovieLensALS

MAP calculation driver to MovieLensALS was not part of SPARK-3066 merge. 
Added the driver in this PR.

@mengxr the results changed compared to my old runs. Any idea if some 
internal ALS tuning has changed (I remember per user regularization change for 
implicit feedback but that should not change explicit results) ?

MAP calculation:

./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics map ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800163, test: 200046.
Test users 6035 MAP 0.019697998843987024

RMSE calculation:

./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics rmse ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800116, test: 200093.
Test RMSE = 0.8558133665979457


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/debasish83/spark irmetrics

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/5869.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #5869


commit 9b3951f558e5673eb475c575f14876421b5a3abc
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-05T01:23:09Z

validate user/product on MovieLens dataset through user input and compute 
map measure along with rmse

commit cd3ab31cb9b244bae2b45396a6269ed1dc59151b
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-05T22:43:11Z

merged with AbstractParams serialization bug

commit 4bbae0f248ca8747b47ecf852d5aba19c9b39dab
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-05T23:23:02Z

comments fixed as per scalastyle

commit 9fa063e1eb172d68248e03797a54acc738543592
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-06T00:05:24Z

import scala.math.round

commit 10cbb37a7881867d801ae6630ffc0d09b3feebf9
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-08T06:31:40Z

provide ratio for topN product validation; generate MAP and prec@k metric 
for movielens dataset

commit f38a1b59e27907f2aa9bd732c5f9147b738d3a0f
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-08T06:45:13Z

use sampleByKey for per user sampling

commit d144f57a58c9424365f1242f90961386c016641e
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-12T04:56:46Z

recommendAll API to MatrixFactorizationModel, uses topK finding using 
BoundedPriorityQueue similar to RDD.top

commit 7163a5c21b394d8bd89694a9f08aa1b446c71956
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-19T21:58:45Z

Added API for batch user and product recommendation; MAP calculation for 
product recommendation per user using randomized split

commit 3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1
Author: Debasish Das debasish@one.verizon.com
Date:   2014-11-19T23:38:45Z

fixed spark coding style for imports

commit ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9
Author: Debasish Das debasish@one.verizon.com
Date:   2015-04-01T01:52:27Z

addressed initial review comments;merged with master;added tests for batch 
predict APIs in matrix factorization

commit 98fa4243dc6041290bdde51e1e899a8be7576470
Author: Debasish Das debasish@one.verizon.com
Date:   2015-04-01T01:59:57Z

updated with master

commit 3a0c4eb7f81ee0845f4945d395f6652c965f941b
Author: Debasish Das debasish@one.verizon.com
Date:   2015-04-01T04:31:01Z

updated with spark master

commit 3640409ac2dd2ea7ab5e67a520726f2387d137e3
Author: Debasish Das debasish@one.verizon.com
Date:   2015-05-02T23:17:45Z

MAP calculation driver added to MovieLensALS




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail

[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...

2015-05-02 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3536#issuecomment-98425139
  
@MLnick @srowen I did an experiment where I computed brute force topK 
similar items using cosine distance and compared the intersection with item 
factor based brute force topK similar items using cosine distance after running 
implicit factorization...intersection is only 42%...this is inline with Google 
Correlate paper where they have to do an additional reorder step in real 
feature space to increase the recall (intersect)...did you guys also see 
similar results for item-item validation ?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5829#issuecomment-98058780
  
@mengxr looks good to me...I will fix SPARK-4321 based on this merge...I 
need blockify for rowSimilarities (tall skinny sparse matrices for row 
similarities)...should we extract it out to IndexedRow ? I can do that cleanup 
in my row similarities PR...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3890#issuecomment-98188550
  
@dlwh we should simply use your smooth max and make max(0, 1 - ya'x) 
differentiable for the first version...that needs no change to breeze...and 
then if needed we use the paepr...don't you have log sum exp f and grad already 
implemented in breeze that can be used ? I can help with soft-max alpha tuning 
if @loachli can put together the formulation in mllib...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3890#issuecomment-98190811
  
I mean for svm the formulation is over all rows right...the smooth max will 
be done on every row and label...max(0, 1 - y_i a_i*x)...so only change will be 
a diff function that calculates the logsumexp and gradient of logsumexp from 
each data row and we aggregate it on the master and solve using BFGS...as long 
as the alpha of logsumexp has been tuned (smooth at first, as we go down, 
tighten it) BFGS will converge to a good solution...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3890#issuecomment-98189044
  
nope...logistic is feature space...svm is data space...the gradient 
calculation / BFGS CostFun will change


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3890#issuecomment-98073720
  
this is linear svm strictly in primal form...there are ways to fix it 
through going to dual space but that needs a linear / nonlinear kernel 
generation which might be an overkill


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3890#issuecomment-98073658
  
@loachli hinge loss in linear svm is max(0, 1 - y*a'x) right ? Just replace 
max with a smooth max and you should be able to smooth hinge gradient and then 
it can be directly aggregated on master and solved by BFGS...smooth max has an 
alpha that you can tune over iteration...start with a large lambda (smooth) and 
tighten it as you go down...breeze already has smooth max and grad implemented 
I think...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...

2015-05-01 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5829#discussion_r29494261
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -137,20 +141,113 @@ class MatrixFactorizationModel(
 MatrixFactorizationModel.SaveLoadV1_0.save(this, path)
   }
 
+  /**
+   * Recommends topK products for all users.
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+MatrixFactorizationModel.recommendForAll(rank, userFeatures, 
productFeatures, num).map {
+  case (user, top) =
+val ratings = top.map { case (product, rating) = Rating(user, 
product, rating) }
+(user, ratings)
+}
+  }
+
+
+  /**
+   * Recommends topK users for all products.
+   *
+   * @param num how many users to return for every product.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+MatrixFactorizationModel.recommendForAll(rank, productFeatures, 
userFeatures, num).map {
+  case (product, top) =
+val ratings = top.map { case (user, rating) = Rating(user, 
product, rating) }
+(product, ratings)
+}
+  }
+}
+
+object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] {
+
+  import org.apache.spark.mllib.util.Loader._
+
+  /**
+   * Makes recommendations for a single user (or product).
+   */
   private def recommend(
   recommendToFeatures: Array[Double],
   recommendableFeatures: RDD[(Int, Array[Double])],
   num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
+val scored = recommendableFeatures.map { case (id, features) =
   (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
 }
 scored.top(num)(Ordering.by(_._2))
   }
-}
 
-object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] {
+  /**
+   * Makes recommendations for all users (or products).
+   * @param rank rank
+   * @param srcFeatures src features to receive recommendations
+   * @param dstFeatures dst features used to make recommendations
+   * @param num number of recommendations for each record
+   * @return an RDD of (srcId: Int, recommendations), where 
recommendations are stored as an array
+   * of (dstId, rating) pairs.
+   */
+  private def recommendForAll(
+  rank: Int,
+  srcFeatures: RDD[(Int, Array[Double])],
+  dstFeatures: RDD[(Int, Array[Double])],
+  num: Int): RDD[(Int, Array[(Int, Double)])] = {
+val srcBlocks = blockify(rank, srcFeatures)
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
--- End diff --

I also like it better as it should scale fine assuming cartesian keys are 
under control...say to 100M x 10M with 400 factors


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...

2015-05-01 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5829#discussion_r29492705
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala ---
@@ -39,7 +39,7 @@ class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)]) extends Se
* @return an RDD that contains the top k values for each key
*/
   def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
-self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
+self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord.reverse))(
--- End diff --

I have to look closely into it tomorrow...I have been using topByKey 
internally and did not remember seeing this bug...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...

2015-05-01 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5829#discussion_r29492840
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala ---
@@ -39,7 +39,7 @@ class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: 
RDD[(K, V)]) extends Se
* @return an RDD that contains the top k values for each key
*/
   def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = {
-self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))(
+self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord.reverse))(
--- End diff --

yup topByKey behavior as implemented was correct...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231: Add RankingMetrics to exam...

2015-05-01 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-98073840
  
Changed the title to add driver for recommendAll API once SPARK-3066 merges 
to master...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...

2015-05-01 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/5829#discussion_r29493623
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -137,20 +141,113 @@ class MatrixFactorizationModel(
 MatrixFactorizationModel.SaveLoadV1_0.save(this, path)
   }
 
+  /**
+   * Recommends topK products for all users.
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+MatrixFactorizationModel.recommendForAll(rank, userFeatures, 
productFeatures, num).map {
+  case (user, top) =
+val ratings = top.map { case (product, rating) = Rating(user, 
product, rating) }
+(user, ratings)
+}
+  }
+
+
+  /**
+   * Recommends topK users for all products.
+   *
+   * @param num how many users to return for every product.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
productID and an array
+   * of rating objects which contains the recommended userId, same 
productID and a score in the
+   * rating field. Semantics of score is same as recommendUsers API
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+MatrixFactorizationModel.recommendForAll(rank, productFeatures, 
userFeatures, num).map {
+  case (product, top) =
+val ratings = top.map { case (user, rating) = Rating(user, 
product, rating) }
+(product, ratings)
+}
+  }
+}
+
+object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] {
+
+  import org.apache.spark.mllib.util.Loader._
+
+  /**
+   * Makes recommendations for a single user (or product).
+   */
   private def recommend(
   recommendToFeatures: Array[Double],
   recommendableFeatures: RDD[(Int, Array[Double])],
   num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
+val scored = recommendableFeatures.map { case (id, features) =
   (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
 }
 scored.top(num)(Ordering.by(_._2))
   }
-}
 
-object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] {
+  /**
+   * Makes recommendations for all users (or products).
+   * @param rank rank
+   * @param srcFeatures src features to receive recommendations
+   * @param dstFeatures dst features used to make recommendations
+   * @param num number of recommendations for each record
+   * @return an RDD of (srcId: Int, recommendations), where 
recommendations are stored as an array
+   * of (dstId, rating) pairs.
+   */
+  private def recommendForAll(
+  rank: Int,
+  srcFeatures: RDD[(Int, Array[Double])],
+  dstFeatures: RDD[(Int, Array[Double])],
+  num: Int): RDD[(Int, Array[(Int, Double)])] = {
+val srcBlocks = blockify(rank, srcFeatures)
+val dstBlocks = blockify(rank, dstFeatures)
+val ratings = srcBlocks.cartesian(dstBlocks).flatMap {
--- End diff --

Normally items are skinny ~ 1M...and ranks are low...50...so 1Mx50 bytes ~ 
50 MB...with 8M products, its 400 MB...I still think that cartesian will be 
slower than the version I added in terms of runtimedid you run any 
benchmark with the old code ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-30 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-98021307
  
@mengxr please go ahead...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-26 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-96403986
  
was very last few weeks...update it in next few days...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-04-11 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-91869124
  
ohh sorry I don't know about requester pays...let me look into it


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-04-10 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-91710700
  
@jkbradley we still could not access the wikipedia dataset on ec2...will it 
be possible for you to upload the 1 Billion token dataset on EC2 ? I wanted to 
do a sparse coding scalability run on the large dataset as well...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-04-10 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-91710827
  
@jkbradley let me know if you need vzcloud access and I can create few 
nodes for you...ec2 might be easier for other's to access it as well...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-04-08 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-90950074
  
if you look into breeze.optimize.proximal.Proximal, I added a library of 
projection/proximal operators...in my experiments looks like projection based 
algorithms (SPG for example) does not work for L1 and sparsity constraint that 
well but works well for positivity and bounds for example...I am thinking to 
extend breeze linear CG / NNLS to handle simple projections and hopefully 
consolidate both into one linear CG with projection...

I support these constraints through a cholesky/LDL based ADMM solver but I 
wanted to write an iterative version using linear CG to see if ADMM performance 
can be improved...For well conditioned QPs papers have found ADMM faster than 
FISTA but I did not see comparisons with linear CG variant...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-04-08 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-90942562
  
@tmyklebu do you have the original NNLS paper in english ? Breeze also has 
a linear CG...I am thinking if it is possible to merge simple projections like 
positivity and bounds with the linear CG...CG based linear solves can be 
extended to handle projection similar to SPG...But NNLS looks like does some 
specific optimization for x = 0...can NNLS be extended to other 
projection/proximal operators ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-04-08 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-90950364
  
Application is topic modeling using Sparsity constraints like L1 and 
probability simplex and supporting bounds in ALS 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-04-07 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-90753041
  
@mengxr @josephk In my internal testing, I am finding the sparse 
formulations useful for extracting genre/topic information out of 
netflix/movielens dataset...The formulations are:
1. Sparse coding: L2 on users/words, L1 on documents/movies
2. L2 on users/words, probability simplex on documents/movies
The reference:
2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in 
Graphlab): 
https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf
showed sparse coding producing better result than LDA...I am considering if 
it makes sense to add a 20 newsgroup flow in examples that was shown in the 
paper ? Also do we have perplexity implemented so that we can start comparing 
topic models...The ALS runtime with sparse formulations are also pretty good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-04-07 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-90753271
  
Sure...Let me do that and point you to the repo...most likely it will be a 
breeze based branch and I will copy the mllib implementation over thr...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89729377
  
I meant MAP...what's the MAP on netflix dataset you have seen before and 
with what lambda ? I am running MAP experiments with various factorization 
formulations including loglikelihood loss with normalization constraints...also 
how do you define MAP for implicit feedback (binary dataset, click is 1 and no 
click is 0) ? In the label set every rating is 1.0 and so there is no ranking 
defined as such...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-05 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89729777
  
agreed with the implicit MAP calculationFor netflix dataset, I got 
0.014...May be I need to use a better regularization...was that 0.05-0.1 number 
from using lambda = 0.065 ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-04 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27769592
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -167,23 +169,66 @@ object MovieLensALS {
   .setProductBlocks(params.numProductBlocks)
   .run(training)
 
-val rmse = computeRmse(model, test, params.implicitPrefs)
-
-println(sTest RMSE = $rmse.)
+params.metrics match {
+  case rmse =
+val rmse = computeRmse(model, test, params.implicitPrefs)
+println(sTest RMSE = $rmse)
+  case map =
+val (map, users) = computeRankingMetrics(model, training, test, 
numMovies.toInt)
+println(sTest users $users MAP $map)
+  case _ = println(sMetrics not defined, options are rmse/map)
+}
 
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
-  def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean)
-: Double = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
+  def computeRmse(
+model: MatrixFactorizationModel,
+data: RDD[Rating],
+implicitPrefs: Boolean) : Double = {
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r
+  }
+  
+  /** Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation */
+  def computeRankingMetrics(
+model: MatrixFactorizationModel,
+train: RDD[Rating],
+test: RDD[Rating],
+n: Int) : (Double, Long) = {
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
--- End diff --

I will update with topByKeyIs there a better place to move this 
function ? may be inside ALS object for example ? That way I can add a 
test-case to guard it ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-04-04 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-89594722
  
@mengxr any insight on it ? the runtime issue is only in first iteration 
and I think you can point out if there is any obvious issue in the way I call 
the solver...looks like something to do with initialization...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-04 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89697236
  
@srowen For netflix dataset what's the MAP you have seen before...I started 
experiments on Netflix dataset...lambda is 0.065 for netflix as well right ? 
For MovieLens 0.065 works well...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-04 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-89706247
  
@coderxiang @mengxr If I have a dataset with implicit (click or 0) then MAP 
is not that well defined right since in label set everything is 1.0 and so 
there is no ordering definedshould we add a rank independent metric for 
implicit datasets ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-04-02 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27712646
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -138,14 +141,122 @@ class MatrixFactorizationModel(
   }
 
   private def recommend(
-  recommendToFeatures: Array[Double],
-  recommendableFeatures: RDD[(Int, Array[Double])],
-  num: Int): Array[(Int, Double)] = {
-val scored = recommendableFeatures.map { case (id,features) =
-  (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1))
+recommendToFeatures: Array[Double],
+recommendableFeatures: RDD[(Int, Array[Double])],
+num: Int): Array[(Int, Double)] = {
+val recommendToVector = Vectors.dense(recommendToFeatures)
+val scored = recommendableFeatures.map {
+  case (id, features) =
+(id, BLAS.dot(recommendToVector, Vectors.dense(features)))
 }
 scored.top(num)(Ordering.by(_._2))
   }
+
+  /**
+   * Recommends topK products for all users
+   *
+   * @param num how many products to return for every user.
+   * @return [(Int, Array[Rating])] objects, where every tuple contains a 
userID and an array of
+   * rating objects which contains the same userId, recommended productID 
and a score in the
+   * rating field. Semantics of score is same as recommendProducts API
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
--- End diff --

For cross validation we use variable num internally but for final 
recommendation global num is fine...I thought having a topK rdd satisfies both 
use-cases...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88346990
  
I reran the map computation on MovieLens with varying ranks:

Example run:
./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 4 --executor-memory 4g --driver-memory 1g 
./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 
--metrics map ~/datasets/ml-1m/ratings.dat  

rank = default

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 800187, test: 200022.
Test users 6035 MAP 0.03499984595868497 


rank = 25

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 799385, test: 200824.
Test users 6034 MAP 0.042580954047373255


rank = 50

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 800289, test: 199920.
Test users 6036 MAP 0.048958415806933275

rank = 100

Got 1000209 ratings from 6040 users on 3706 movies. 

Training: 801148, test: 199061.
Test users 6038 MAP 0.05503487765882986

The numbers are consistent with my runs before.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88347022
  
@mengxr could you please do another passI might have missed the JavaRDD 
compatibility issue but fixed rest of your comments...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27533769
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
--- End diff --

I am bit confused...recommendProducts is also a public member but that's 
not in companion object...recommendProductsForUsers is also very similar API 
right ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27525485
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -74,6 +75,9 @@ object MovieLensALS {
   opt[Unit](implicitPrefs)
 .text(use implicit preference)
 .action((_, c) = c.copy(implicitPrefs = true))
+  opt[Unit](validateRecommendation)
--- End diff --

cleaned up --validateRecommendation to --metrics


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528071
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
--- End diff --

followed the indentation from current code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528120
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
--- End diff --

added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528991
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey.map {
+  case (userId, products) = (userId, products.toArray)
+}
+
+val rankings = 
model.recommendProductsForUsers(n).join(trainUserLabels).map {
+  case (userId, (pred, train)) = {
+val predictedProducts = pred.map { _.product }
--- End diff --

done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528959
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
+}
+
+val trainUserLabels = train.map {
+  x = (x.user, x.product)
+}.groupByKey.map {
+  case (userId, products) = (userId, products.toArray)
--- End diff --

merged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27525568
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
--- End diff --

fixed...can be fit in one line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528198
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27528238
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala ---
@@ -171,18 +175,62 @@ object MovieLensALS {
 
 println(sTest RMSE = $rmse.)
 
+if (params.validateRecommendation) {
+  val (map, users) = computeRankingMetrics(model,
+training, test, numMovies.toInt)
+  println(sTest users $users MAP $map)
+}
+
 sc.stop()
   }
 
   /** Compute RMSE (Root Mean Squared Error). */
   def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], 
implicitPrefs: Boolean) = {
-
-def mapPredictedRating(r: Double) = if (implicitPrefs) 
math.max(math.min(r, 1.0), 0.0) else r
-
 val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, 
x.product)))
-val predictionsAndRatings = predictions.map{ x =
-  ((x.user, x.product), mapPredictedRating(x.rating))
+val predictionsAndRatings = predictions.map { x =
+  ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs))
 }.join(data.map(x = ((x.user, x.product), x.rating))).values
 math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - 
x._2)).mean())
   }
+
+  def mapPredictedRating(r: Double, implicitPrefs: Boolean) = {
+if (implicitPrefs) math.max(math.min(r, 1.0), 0.0)
+else r
+  }
+
+  /**
+   * Compute MAP (Mean Average Precision) statistics for top N product 
Recommendation
+   */
+  def computeRankingMetrics(model: MatrixFactorizationModel,
+train: RDD[Rating], test: RDD[Rating], n: Int) = {
+
+val ord = Ordering.by[(Int, Double), Double](x = x._2)
+
+val testUserLabels = test.map {
+  x = (x.user, (x.product, x.rating))
+}.groupByKey.map {
+  case (userId, products) =
+val sortedProducts = products.toArray.sorted(ord.reverse)
+(userId, sortedProducts.map { _._1 })
--- End diff --

fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529347
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD
  *and the features computed for this product.
  */
 class MatrixFactorizationModel private[mllib] (
-val rank: Int,
-val userFeatures: RDD[(Int, Array[Double])],
-val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
+  val rank: Int,
+  val userFeatures: RDD[(Int, Array[Double])],
+  val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
   /** Predict the rating of one user for one product. */
   def predict(user: Int, product: Int): Double = {
-val userVector = new DoubleMatrix(userFeatures.lookup(user).head)
-val productVector = new 
DoubleMatrix(productFeatures.lookup(product).head)
-userVector.dot(productVector)
+val userVector = Vectors.dense(userFeatures.lookup(user).head)
--- End diff --

I cleaned netlib.ddot to BLAS.dot...they will be same for these cases


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529308
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD
  *and the features computed for this product.
  */
 class MatrixFactorizationModel private[mllib] (
-val rank: Int,
-val userFeatures: RDD[(Int, Array[Double])],
-val productFeatures: RDD[(Int, Array[Double])]) extends Serializable {
+  val rank: Int,
--- End diff --

after merge this is fixed


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529218
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -17,14 +17,20 @@
 
 package org.apache.spark.mllib.recommendation
 
-import java.lang.{Integer = JavaInteger}
-
-import org.jblas.DoubleMatrix
+import java.lang.{ Integer = JavaInteger }
 
 import org.apache.spark.SparkContext._
-import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
+import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD }
 import org.apache.spark.rdd.RDD
 
+import org.apache.spark.util.collection.Utils
+import org.apache.spark.util.BoundedPriorityQueue
+
+import scala.Ordering
--- End diff --

By organizing imports you mean same package imports will move to one right ?

Old: 

import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.BLAS

New:

import org.apache.spark.mllib.linalg.{Vectors, Vector, BLAS}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529231
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -17,14 +17,20 @@
 
 package org.apache.spark.mllib.recommendation
 
-import java.lang.{Integer = JavaInteger}
-
-import org.jblas.DoubleMatrix
+import java.lang.{ Integer = JavaInteger }
 
 import org.apache.spark.SparkContext._
-import org.apache.spark.api.java.{JavaPairRDD, JavaRDD}
+import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD }
--- End diff --

cleaned


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27529681
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
+   */
+  def recommendProductsForUsers(
+userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = {
+val userFeaturesTopK = userFeatures.join(userTopK).map {
+  case (userId, (userFeature, topK)) =
+(userId, FeatureTopK(Vectors.dense(userFeature), topK))
+}
+val productVectors = productFeatures.map {
+  x = (x._1, Vectors.dense(x._2))
+}.collect
+
+userFeaturesTopK.map {
+  case (userId, userFeatureTopK) = {
+val predictions = productVectors.map {
+  case (productId, productVector) =
+Rating(userId, productId,
+  BLAS.dot(userFeatureTopK.feature, productVector))
--- End diff --

I will bring in lot of level 3 BLAS in the next PR...I am writing the dgemv 
and dgemm versions for several of these APIs...For now I will add a TODO


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on a diff in the pull request:

https://github.com/apache/spark/pull/3098#discussion_r27535273
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
 ---
@@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] (
 recommend(productFeatures.lookup(product).head, userFeatures, num)
   .map(t = Rating(t._1, product, t._2))
 
+  /**
+   * Recommends topK users/products.
+   *
+   * @param num how many users to return. The number returned may be less 
than this.
+   * @return [Array[Rating]] objects, each of which contains a userID, the 
given productID and a
+   *  score in the rating field. Each represents one recommended user, 
and they are sorted
+   *  by score, decreasing. The first returned is the one predicted to be 
most strongly
+   *  recommended to the product. The score is an opaque value that 
indicates how strongly
+   *  recommended the user is.
+   */
+
+  /**
+   * Recommend topK products for all users
+   */
+  def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = userFeatures.map { x = (x._1, num) }
+recommendProductsForUsers(topK)
+  }
+
+  /**
+   * Recommend topK users for all products
+   */
+  def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = {
+val topK = productFeatures.map { x = (x._1, num) }
+recommendUsersForProducts(topK)
+  }
+
+  val ord = Ordering.by[Rating, Double](x = x.rating)
+  case class FeatureTopK(feature: Vector, topK: Int)
+
+  /**
+   * Recommend topK products for users in userTopK RDD
--- End diff --

documented the public batch prediction APIs


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88291470
  
@mengxr I also added 2 test-cases for batch predict APIs. These features 
are useful if users are interested in computing MAP measures...Let me know if I 
move the function computeRankingMetrics and computeRMSE to the companion class 
of ml.recommendation.ALS ? Currently both of them are in examples...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...

2015-03-31 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3098#issuecomment-88292172
  
If we move computeRankingMetrics and computeRMSE to a better place, I can 
guard it through tests...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-28 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-87342283
  
What are MiMa tests ? I am bit confused on it...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-03-28 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-87276063
  
Updated the PR with breeze 0.11.2...Except first iteration, rest of them 
are at par:

Breeze NNLS:

TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime 
./work/app-20150328110507-0003/0/stderr 
15/03/28 11:05:16 INFO ALS: solveTime 228.358 ms
15/03/28 11:05:16 INFO ALS: solveTime 80.773 ms
15/03/28 11:05:17 INFO ALS: solveTime 96.837 ms
15/03/28 11:05:17 INFO ALS: solveTime 92.252 ms
15/03/28 11:05:18 INFO ALS: solveTime 55.923 ms
15/03/28 11:05:18 INFO ALS: solveTime 53.503 ms
15/03/28 11:05:19 INFO ALS: solveTime 96.827 ms
15/03/28 11:05:20 INFO ALS: solveTime 99.835 ms
15/03/28 11:05:20 INFO ALS: solveTime 56.032 ms
15/03/28 11:05:21 INFO ALS: solveTime 55.832 ms

mllib NNLS:

TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime 
./work/app-20150328110532-0004/0/stderr 
15/03/28 11:05:41 INFO ALS: solveTime 92.086 ms
15/03/28 11:05:41 INFO ALS: solveTime 59.103 ms
15/03/28 11:05:42 INFO ALS: solveTime 80.177 ms
15/03/28 11:05:42 INFO ALS: solveTime 78.755 ms
15/03/28 11:05:43 INFO ALS: solveTime 51.966 ms
15/03/28 11:05:43 INFO ALS: solveTime 46.426 ms
15/03/28 11:05:44 INFO ALS: solveTime 93.656 ms
15/03/28 11:05:44 INFO ALS: solveTime 84.458 ms
15/03/28 11:05:45 INFO ALS: solveTime 49.22 ms
15/03/28 11:05:45 INFO ALS: solveTime 45.626 ms

export solver=mllib runs the mllib NNLS...I will wait for the feedbacks...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-03-27 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-86949884
  
@mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-27 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-86950106
  
@mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark...I 
can clean up the PR for reviews


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-27 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-87165211
  
I integrated with Breeze 0.11.2. Only visible difference is first iteration

Breeze QuadraticMinimizer:

TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime 
./work/app-20150327221722-/0/stderr 
15/03/27 22:17:32 INFO ALS: solveTime 234.153 ms
15/03/27 22:17:32 INFO ALS: solveTime 82.499 ms
15/03/27 22:17:33 INFO ALS: solveTime 83.579 ms
15/03/27 22:17:33 INFO ALS: solveTime 83.039 ms
15/03/27 22:17:34 INFO ALS: solveTime 35.545 ms
15/03/27 22:17:34 INFO ALS: solveTime 30.707 ms
15/03/27 22:17:35 INFO ALS: solveTime 53.025 ms
15/03/27 22:17:36 INFO ALS: solveTime 53.021 ms
15/03/27 22:17:36 INFO ALS: solveTime 31.329 ms
15/03/27 22:17:37 INFO ALS: solveTime 32.136 ms

mllib CholeskySolver:

TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime 
./work/app-20150327221/0/stderr 
app-20150327221722-/ app-20150327221803-0001/ 
TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime 
./work/app-20150327221803-0001/0/stderr 
15/03/27 22:18:11 INFO ALS: solveTime 98.692 ms
15/03/27 22:18:12 INFO ALS: solveTime 38.997 ms
15/03/27 22:18:12 INFO ALS: solveTime 62.361 ms
15/03/27 22:18:13 INFO ALS: solveTime 60.316 ms
15/03/27 22:18:13 INFO ALS: solveTime 36.569 ms
15/03/27 22:18:14 INFO ALS: solveTime 36.321 ms
15/03/27 22:18:14 INFO ALS: solveTime 60.007 ms
15/03/27 22:18:15 INFO ALS: solveTime 59.771 ms
15/03/27 22:18:15 INFO ALS: solveTime 36.519 ms
15/03/27 22:18:16 INFO ALS: solveTime 38.295 ms

Visible difference is in first 2 iterations as showed in previous 
experiments as well. I fixed the random seed test now and so different runs 
will not produce the same result.

I need this structure to build ALM as ALM extends mllib.ALS and adds 
LossType in constructor along with userConstraint and itemConstraint...

Right now I am experimenting with LeastSquare (for tests with ALS) and I am 
experimenting with LeastSquare and LogLikelihood loss...

For this PR I have updated MovieLensALS with userConstraint and 
itemConstraint and I am considering if we should add a Sparse Coding 
formulation in examples now or we bring that in a separate PR ?

I have not cleaned up CholeskySolver from ALS yet and waiting for the 
feedbacks but I have added test-cases in ml.ALSSuite for all the 
constraintsAt ALS flow level I need to construct more test-cases and I can 
bring them in separate PR as well...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-24 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-85814758
  
@mengxr I discussed with David and the only reason I can think of is that 
inside the solvers I am using DenseMatrix and DenseVector in-place of primitive 
arrays for workspace creationthat might be causing the first iteration 
runtime difference due to loading up the interface classes and other features 
that comes with DenseMatrix and DenseVector...I can move to primitive arrays 
for the workspace but then the code will look ugly...Let me know if I should ? 
I am surprised that this issue does not show up after the first call !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-84827225
  
I looked more into it and I will open up an API in Breeze 
QuadraticMinimizer where in-place of DenseMatrix gram, upper triangular gram 
can be sent but the inner workspace has to be n x n because for Cholesky we 
need to compute LL' and for Quasi Definite System we have to compute LDL' / LU 
and both of them need n x n space...so I won't be able to decrease the 
QuadraticMinimizer workspace size...for dposv BLAS allocates memory for LL' and 
it is not visible to user...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...

2015-03-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/5005#issuecomment-85348266
  
All the runtime enhancements are being added to Breeze in this PR: 
https://github.com/scalanlp/breeze/pull/386
Please let me know if there are additional feedbacks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-85351062
  
All the runtime enhancements are being added to Breeze in this PR: 
https://github.com/scalanlp/breeze/pull/386
Please let me know if there are additional feedbacks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-23 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-85161041
  
@mengxr I added the optimization for lower triangular matrix and now they 
are very close...Let me know what do you think and if there are any other 
tricks you would like me to try...Note that with these optimization, 
QuadraticMinimizer with POSITIVE constraint will also run much faster

Breeze QuadraticMinimizer (default):

unset solver; ./bin/spark-submit --master 
spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 1 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 
--numIterations 2 ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800670, test: 199539.
Quadratic minimization userConstraint SMOOTH productConstraint SMOOTH
Running Breeze QuadraticMinimizer for users with constraint SMOOTH
Running Breeze QuadraticMinimizer for items with constraint SMOOTH
Test RMSE = 2.4985081126233846.

15/03/23 12:26:55 INFO ALS: solveTime 205.379 ms
15/03/23 12:26:55 INFO ALS: solveTime 72.116 ms
15/03/23 12:26:56 INFO ALS: solveTime 74.034 ms
15/03/23 12:26:56 INFO ALS: solveTime 77.379 ms
15/03/23 12:26:57 INFO ALS: solveTime 36.532 ms
15/03/23 12:26:57 INFO ALS: solveTime 29.775 ms
15/03/23 12:26:58 INFO ALS: solveTime 48.925 ms
15/03/23 12:26:58 INFO ALS: solveTime 51.904 ms
15/03/23 12:26:59 INFO ALS: solveTime 30.882 ms
15/03/23 12:26:59 INFO ALS: solveTime 30.658 ms

ML CholeskySolver:

export solver=mllib; ./bin/spark-submit --master 
spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class 
org.apache.spark.examples.mllib.MovieLensALS --jars 
~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar 
--total-executor-cores 1 
./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 
--numIterations 2 ~/datasets/ml-1m/ratings.dat

Got 1000209 ratings from 6040 users on 3706 movies.
Training: 800670, test: 199539.
Quadratic minimization userConstraint SMOOTH productConstraint SMOOTH
Test RMSE = 2.4985081126233846.

TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime 
./work/app-20150323122612-0002/0/stderr 
15/03/23 12:26:20 INFO ALS: solveTime 102.243 ms
15/03/23 12:26:21 INFO ALS: solveTime 38.195 ms
15/03/23 12:26:21 INFO ALS: solveTime 60.583 ms
15/03/23 12:26:22 INFO ALS: solveTime 59.882 ms
15/03/23 12:26:22 INFO ALS: solveTime 36.59 ms
15/03/23 12:26:23 INFO ALS: solveTime 36.021 ms
15/03/23 12:26:23 INFO ALS: solveTime 59.271 ms
15/03/23 12:26:24 INFO ALS: solveTime 59.217 ms
15/03/23 12:26:24 INFO ALS: solveTime 36.344 ms
15/03/23 12:26:25 INFO ALS: solveTime 35.838 ms

I am running only 2 iterations but you can see in the tail the solvers run 
at par...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...

2015-03-22 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3890#issuecomment-84624771
  
Can we discuss it in JIRA ? For svm with owlqn what's the orthant wise 
constraint you are adding ? There are ways to handle the max differentiability 
in bfgs as well but I am not sure how well it works...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...

2015-03-22 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/3221#issuecomment-84643641
  
I am adding ml.QuadraticSolver tests that builds upon normal equation 
(similar to CholeskySolver tests) for 1 - 5 basically...will update in a bit...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...

2015-03-22 Thread debasish83

Github user debasish83 commented on the pull request:

https://github.com/apache/spark/pull/1290#issuecomment-84708094
  
@witgo there are lot of useful building blocks in your RBM PR...are you 
planning to consolidate them in this PR ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 190 matches

Mail list logo