[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r115745818 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -154,22 +159,23 @@ class LinearSVCSuite extends SparkFunSuite with MLlibTestSparkContext with Defau test("linearSVC with sample weights") { def modelEquals(m1: LinearSVCModel, m2: LinearSVCModel): Unit = { - assert(m1.coefficients ~== m2.coefficients absTol 0.05) + assert(m1.coefficients ~== m2.coefficients absTol 0.07) assert(m1.intercept ~== m2.intercept absTol 0.05) } - -val estimator = new LinearSVC().setRegParam(0.01).setTol(0.01) -val dataset = smallBinaryDataset -MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, modelEquals) -MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 3) -MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, modelEquals, 42L) +LinearSVC.supportedOptimizers.foreach { opt => + val estimator = new LinearSVC().setRegParam(0.02).setTol(0.01).setSolver(opt) + val dataset = smallBinaryDataset + MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, modelEquals) + MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 3) + MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, modelEquals, 42L) +} } - test("linearSVC comparison with R e1071 and scikit-learn") { -val trainer1 = new LinearSVC() + test("linearSVC OWLQN comparison with R e1071 and scikit-learn") { +val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN) .setRegParam(0.2) // set regParam = 2.0 / datasize / c --- End diff -- @hhbyyh I saw some posts that hinge loss is not differentiable but squared hinge loss is for practical purposes...can you please point to a reference on squared hinge loss ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r115741206 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -154,22 +159,23 @@ class LinearSVCSuite extends SparkFunSuite with MLlibTestSparkContext with Defau test("linearSVC with sample weights") { def modelEquals(m1: LinearSVCModel, m2: LinearSVCModel): Unit = { - assert(m1.coefficients ~== m2.coefficients absTol 0.05) + assert(m1.coefficients ~== m2.coefficients absTol 0.07) assert(m1.intercept ~== m2.intercept absTol 0.05) } - -val estimator = new LinearSVC().setRegParam(0.01).setTol(0.01) -val dataset = smallBinaryDataset -MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, modelEquals) -MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 3) -MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, modelEquals, 42L) +LinearSVC.supportedOptimizers.foreach { opt => + val estimator = new LinearSVC().setRegParam(0.02).setTol(0.01).setSolver(opt) + val dataset = smallBinaryDataset + MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, modelEquals) + MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 3) + MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, modelEquals, 42L) +} } - test("linearSVC comparison with R e1071 and scikit-learn") { -val trainer1 = new LinearSVC() + test("linearSVC OWLQN comparison with R e1071 and scikit-learn") { +val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN) .setRegParam(0.2) // set regParam = 2.0 / datasize / c --- End diff -- hinge loss is not differentiable...how are you smoothing it before you can use a quasi newton solversince the papers smooth the max, a newton/quasi-newton solver should work well...if you are keeping the non-differentiable loss better will be to use a sub-gradient solver as suggested by the talk...I will evaluate the formulation... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for L...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/17862#discussion_r115659479 --- Diff: mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala --- @@ -154,22 +159,23 @@ class LinearSVCSuite extends SparkFunSuite with MLlibTestSparkContext with Defau test("linearSVC with sample weights") { def modelEquals(m1: LinearSVCModel, m2: LinearSVCModel): Unit = { - assert(m1.coefficients ~== m2.coefficients absTol 0.05) + assert(m1.coefficients ~== m2.coefficients absTol 0.07) assert(m1.intercept ~== m2.intercept absTol 0.05) } - -val estimator = new LinearSVC().setRegParam(0.01).setTol(0.01) -val dataset = smallBinaryDataset -MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, modelEquals) -MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 3) -MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC]( - dataset.as[LabeledPoint], estimator, modelEquals, 42L) +LinearSVC.supportedOptimizers.foreach { opt => + val estimator = new LinearSVC().setRegParam(0.02).setTol(0.01).setSolver(opt) + val dataset = smallBinaryDataset + MLTestingUtils.testArbitrarilyScaledWeights[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, modelEquals) + MLTestingUtils.testOutliersWithSmallWeights[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, 2, modelEquals, outlierRatio = 3) + MLTestingUtils.testOversamplingVsWeighting[LinearSVCModel, LinearSVC]( +dataset.as[LabeledPoint], estimator, modelEquals, 42L) +} } - test("linearSVC comparison with R e1071 and scikit-learn") { -val trainer1 = new LinearSVC() + test("linearSVC OWLQN comparison with R e1071 and scikit-learn") { +val trainer1 = new LinearSVC().setSolver(LinearSVC.OWLQN) .setRegParam(0.2) // set regParam = 2.0 / datasize / c --- End diff -- This slides also explain it...Please see slide 32...the max can be replaced by soft-max with the softness lambda can be tuned...log-sum-exp is a standard soft-max that can be used which is similar to ReLu functions and we can re-use it from MLP: ftp://ftp.cs.wisc.edu/math-prog/talks/informs99ssv.ps ftp://ftp.cs.wisc.edu/pub/dmi/tech-reports/99-03.pdf I can add the formulation if there is interest...it needs some tuning for soft-max parameter but the convergence will be good with LBFGS (OWLQN is not needed) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #17862: [SPARK-20602] [ML]Adding LBFGS as optimizer for LinearSV...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/17862 @hhbyyh can we smooth the hinge-loss using soft-max (variant of ReLU) and then use LBFGS ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/12574 test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #14473: [SPARK-16495] [MLlib]Add ADMM optimizer in mllib package
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/14473 ADMM is already available as a breeze solver (BFGS, OWLQN, NonlinearMinimizer) which is integrated with ml/mllib...It will be great if you can look into it and let me know if you need pointers in running comparisons with OWLQN: https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/optimize/proximal/NonlinearMinimizer.scala This is implemented based on the paper you cited. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/12574 Can we close it ? Looks like SPARK-18235 opened up recommendForAll --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/12574 test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/12574 I will take a pass at the PR as well.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #12574: [SPARK-13857][ML][WIP] Add "recommend all" functionality...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/12574 @MLnick I recently visited IBM STC but unfortunately missed you on the meeting...we discussed about the ML/MLlib changes for matrix factorization... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #458: [SPARK-1543][MLlib] Add ADMM for solving Lasso (and elasti...
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/458 ADMM is already implemented as part of Breeze proximal NonlinearMinimizer where the ADMM solver stays in master and gradient calculator is used in similar manner as how Breeze LBFGS/OWLQN has been plugged in...I did not open up a PR since OWLQN has been chosen for L1 logistic... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #1110: [SPARK-2174][MLLIB] treeReduce and treeAggregate
Github user debasish83 commented on the issue: https://github.com/apache/spark/pull/1110 @mengxr say I have 20 nodes and 16 cores on each node, do you recommend running treeReduce with 320 partitions and OpenBLAS with numThreads=1 on each partition for SeqOp OR treeReduce with 20 partitions and OpenBLAS with numThreads=16 on each partition for SeqOp...Do you have plans on further improvement ideas of decreasing network shuffle using treeReduce/treeAggregate or if there is a JIRA open so that we can move the discussion on it ? Looks like shuffle is compressed by default on Spark using snappy already...do you recommend compressing the vector logically ? SparkContext: 20 nodes, 16 cores, sc.defaultParallelism 320 def gramSize(n: Int) = (n*n+1)/2 val combOp = (v1: Array[Float], v2: Array[Float]) => { var i = 0 while (i < v1.length) { v1(i) += v2(i) i += 1 } v1 } val n = gramSize(4096) val vv = sc.parallelize(0 until sc.defaultParallelism).map(i => Array.fill[Float](n)(0)) Option 1: 320 partitions, 1 thread on combOp per partition val start = System.nanoTime(); vv.treeReduce(combOp, 2); val reduceTime = (System.nanoTime() - start)*1e-9 reduceTime: Double = 5.639030243006 Option 2: 20 partitions, 1 thread on combOp per partition val coalescedvv = vv.coalesce(20) coalescedvv.count val start = System.nanoTime(); coalescedvv.treeReduce(combOp, 2); val reduceTime = (System.nanoTime() - start)*1e-9 reduceTime: Double = 3.914068564004 Option 3: 20 partitions, OpenBLAS numThread=16 per partition Setting up OpenBLAS on cluster, I will update soon. Let me know your thoughts. I think if underlying operations are Dense BLAS level1, level2 or level3, running with higher OpenBLAS threads and reducing number of partitions should help in decreasing cross partition shuffle. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5869#issuecomment-162240882 @srowen actually I am not sure if MAP calculation got added in ML pipeline or not...I will look into it and if someone else already added it, I will close the PR --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-136503426 @rezazadeh got busy with spark streaming version of KNN :-) I will open up 2 PRs over the weekend as we discussed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][WIP] SPARK-4638: Kernels feature for M...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5503#issuecomment-120658511 @dbtsai @mandar2812 I found the abstraction for kernel as explained in my PR https://github.com/apache/spark/pull/6213 more generic in practical use-cases compared to the usual interface available in scikit-learn...It will be great if we can come up with a strategy such that this PR calls IndexedRowMatrix.rowSimilarity to get the kernel from the data as represented with RDD[LabeledPoint] --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-109654217 Internally we are using this code for euclidean/rbf driving PIC for example...but sure we can focus on cosine first... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-107113056 @rezazadeh sure I will do thatCould you add a JIRA for 3 (Kernel Clustering / PIC) so that we can add RBFKernel flow and implement PIC with vector - matrix multiply for comparisons ? Also in general topK can decrease the kernel size and is a cross validation parameter to see the degradation of the clustering compared to full kernel which is always difficult to keep as the rows grow...No such experiments have been done for PIC. I am experimenting with gemv based optimization for SparseVector x SparseMatrix and if I get further speedup compared to level 1 flow most likely we will provide both options to the users in SPARK-4823. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3536#issuecomment-105026856 Let's continue the validation discussion on https://github.com/apache/spark/pull/6213. The PR introduces batch gemm based similarity computation in MatrixFactorizationModel using kernel abstraction. Do need the online version as well that Steven added or it can be extracted out of batch results ? My focus was more on speeding up batch computation... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-104968079 Runtime comparison are posted on SPARK-4823 on MovieLens1m dataset, 8 core, 4 GB executor memory from my laptop. Stage 24 - 35 is the row similarity flow. Total runtime ~ 20 s Stage 64 is col similarity mapPartitions. Total runtime ~ 4.6 mins I have not yet gone to gemv which will decrease the runtime further but will add some approximations in RBFKernel. I think for users we should give both vector based flow and gemv based flow to let them choose what they want. I updated the driver code in examples.mllib.MovieLensSimilarity @MLnick @sowen could you please take a look at examples.mllib.MovieLensSimilarity ? I am running ALS in implicit mode with no regularization (basically full RMSE optimization) and comparing similarities as generated from raw features and item similarities. I get topK=50 from raw features as golden labels and find MAP on top50 predictions from MatrixFactorizationModel.similarItems() that this PR added. I will add a testcase for RBFKernel and add the PowerIterationClustering driver to use IndexedRowMatrix.rowSimilarities code before taking out WIP label from the PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-104970859 Refactoring MatrixFactorizationModel.recommendForAll to a common place like Vectors/Matrices will help users who have dense data with modest columns (~1000-10K, most IoT data falls in this category) reuse dgemm based kernel computation. I am not sure which is a good place for this code ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-104936678 Internally vector flow in IndexedRowMatrix has helped us to do additional optimization through user defined kernels and cut the computation which won't happen if we go to dgemv since the matrix compute will be done first before filters based on norm can be applied...I think we should keep the vector based kernel compute and get user feedback first... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [WIP][MLLIB][SPARK-4675][SPARK-4823]RowSimilar...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-104934928 @mengxr I generalized MatrixFactorizationModel.recommendAll and use it for similarUsers and similarProducts and use dgemm...In IndexedRowMatrix I only exposed rowSimilarity as the public API and it uses blocked BLAS level-1 computation...It is easy to use gemv in IndexedRowMatrix.rowSimilarity for CosineKernel but for RBFKernel things will get tricky since for sparse vector, I don't think we can write euclidean distance as norm1*norm1 + norm2*norm2 - 2 dot(x, y) without letting go of some accuracy which might be ok compared to runtime benefits...I am looking further into RBF computation using dgemv... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-103771669 Actually both for Euclidean and RBF it is possible as || x - y || can be decomposed as ||x||2 + ||y||2 - 2*dot(x,y) where dot(x,y) can be computed through dgemv...dgemm we can't use yet since BLAS does not have SparseMatrix x SparseMatrix...Is there a open PR for it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-103925316 For gemv it is not clear how to re-use the scratch space for result vector...if we can't reuse the result vector over multiple calls to kernel.compute we won't get much runtime benefits...I am considering that for Vector based IndexedRowMatrix, we define the kernel as the traditional (vector, vector) compute and use level 1 BLAS as done in this PR. The big runtime benefit will come from Approximate KNN that I will open up next but we still need the brute-force KNN for cross validation. For (Long, Array[Double]) from matrix factorization model (similarUsers and similarProducts) we can use dgemm specifically for DenseMatrix x DenseMatrix...@mengxr what do you think ? That way we can use dgemm when the features are Dense..Also (Long, Array[Double]) data structure can be defined in recommendation/linalg package and re-used by dense kernel computation Or perhaps for similarity/KNN computation it is fine to stay in vector space and not do gemv/gemm optimization? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-103615439 I am thinking more. May be EuclideanKernel can be decomposed using Matrix x Vector as well --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-103614290 SparseMatrix x SparseVector got merged to Master today https://github.com/apache/spark/pull/6209. I will update the PR and separate the code path for CosineKernel/ProductKernel and EuclideanKernel/RBFKernel to see the runtime improvements. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-102841783 @mengxr the failures are related to yarn suite which does not look related to my changes...tests I added ran fine... [info] *** 1 TEST FAILED *** [error] Failed: Total 39, Failed 1, Errors 0, Passed 38 [error] Failed tests: [error] org.apache.spark.deploy.yarn.YarnClusterSuite --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-7681][MLlib] Add SparseVector support f...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6209#issuecomment-102840355 Are there runtime comparisons posted with vector*vector operations for these changes BLAS-1 vs BLAS-2 ? SparseMatrix * SparseVector compared to Array[SparseVector] x SparseVector --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/6213#issuecomment-102843964 For CosineKernel and ProductKernel, we should be able to have a separate code path with BLAS-2 once SparseMatrix x SparseVector merges and BLAS-3 once SparseMatrix x SparseMatrix merges..Basically refactor blockify from MatrixFactorizationModel to IndexedRowMatrix...Right now the sparse features are not in master yet...For Euclidean, RBF and Pearson, even with these changes merged, I think we still have to stay in BLAS-1 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity
GitHub user debasish83 opened a pull request: https://github.com/apache/spark/pull/6213 [MLLIB][SPARK-4675, SPARK-4823] RowSimilarity @mengxr @srowen For RowMatrix with 100K columns, colSimilarity with bruteforce/dimsum sampling is used. This PR adds rowSimilarity to IndexedRowMatrix which outputs a CoordinateMatrix. For matrices where columns are 1M, rowSimilarity flow scales better compared to column similarity flow. For most applications, topK similar items requirement is much less than all available items and therefore the rowSimilarity API takes topK and threshold as input. topK and threshold help in improving shuffle space. For MatrixFactorization model generally the columns for both user and product factors are ~50-200 and therefore the column similarity flow does not work for such cases. This PR also adds batch similarUsers and similarProducts (SPARK-4675). The following ideas are added: 1. Similarity computation is abstracted as Kernel 2. Kernel implementations for Cosine, RBF, Euclidean and Product (for distributed matrix multiply) are added 3. Tests cover Cosine Kernel. More tests will be added for Euclidean, RBF and Product kernels. 4. IndexedRowMatrix object adds a kernalized distributed matrix multiply which is used by similarity computation. 5. In examples, MovieLensSimilarity is added that shows col and row based flows on MovieLens as runtime experiment. 6. Level-1 BLAS is used so that kernel abstraction can be used. We can either design the Kernel abstraction with Level-3 BLAS (might be difficult) or use BlockMatrix for distributed matrix multiply. Next steps: 1. In MovieLensSimilarity add ALS + similarItems example 2. Use RBF similarity in power iteration clustering flow From internal experiments, we have run 6M users, 1.6M items, 351M ratings through row similarity flow with topK=200 in 1.1 hr with 240 cores running over 30 nodes. We had difficult time in scaling column similarity flow since the topK optimization can't be added until reduce phase is done in that flow. On MovieLens-1M and Netflix dataset I will report row and col similarity runtime comparisons. You can merge this pull request into a Git repository by running: $ git pull https://github.com/debasish83/spark similarity Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/6213.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6213 commit f9fd6fbfb1a55142a9eb8f2129d3729ca25ab501 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-17T00:05:52Z blocked kernalized row similarity calculation and tests commit 66176f9f346c324b9c77c252be369e24f7fdd991 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-17T00:06:36Z Cosine, Euclidean, RBF and Product Kernel added commit 3f96963f80a40f3a4fce6b6dbd97c20605ebaecc Author: Debasish Das debasish@one.verizon.com Date: 2015-05-17T00:07:28Z row similarity API added to drive MatrixFactorizationModel similarUsers and similarProducts commit 6dc9e18d507cfe0d2ee12e768ca6bddb5c3c4b38 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-17T00:09:24Z MovieLens flow to demonstrate item similarity calculation using raw features and ALS factors commit 71f24a4629cf54c39af4e9e598d9808d85952532 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-17T00:09:45Z import cleanup commit cc4e104b7430e3fe2e6bf71489638321076428a3 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-17T00:11:15Z Merge branch 'similarity' of https://github.com/debasish83/spark into similarity --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3536#issuecomment-99098372 @MLnick yes that's what I did...I have to convince users why use factor vectors :-) For user-item recommendation, convincing is easy by showing the ranking improvement through ALS @srowen without coming up with a validation strategy, someone might propose to run a different algorithm (KMeans on raw feature space followed by (item-cluster) join (cluster-items)) and claims his item-item results are better...how do we know whether ALS based flow is producing better result or KMeans based flow ? NNALS can be thought of soft-kmeans as well and so these flows are very similar. I am focused on implicit feedback here because then only we can run either KMeans or Similarity on raw feature space...With explicit feedback, I agree that cosine similarity is not valid in original feature space. But in most practical datasets, we are dealing with implicit feedback. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
GitHub user debasish83 reopened a pull request: https://github.com/apache/spark/pull/5869 [SPARK-4231][MLLIB][Examples] MAP calculation added to examples.MovieLensALS MAP calculation driver to MovieLensALS was not part of SPARK-3066 merge. Added the driver in this PR. @mengxr the results changed compared to my old runs. Any idea if some internal ALS tuning has changed (I remember per user regularization change for implicit feedback but that should not change explicit results) ? MAP calculation: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics map ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. Training: 800163, test: 200046. Test users 6035 MAP 0.019697998843987024 RMSE calculation: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics rmse ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. Training: 800116, test: 200093. Test RMSE = 0.8558133665979457 You can merge this pull request into a Git repository by running: $ git pull https://github.com/debasish83/spark irmetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5869.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5869 commit 9b3951f558e5673eb475c575f14876421b5a3abc Author: Debasish Das debasish@one.verizon.com Date: 2014-11-05T01:23:09Z validate user/product on MovieLens dataset through user input and compute map measure along with rmse commit cd3ab31cb9b244bae2b45396a6269ed1dc59151b Author: Debasish Das debasish@one.verizon.com Date: 2014-11-05T22:43:11Z merged with AbstractParams serialization bug commit 4bbae0f248ca8747b47ecf852d5aba19c9b39dab Author: Debasish Das debasish@one.verizon.com Date: 2014-11-05T23:23:02Z comments fixed as per scalastyle commit 9fa063e1eb172d68248e03797a54acc738543592 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-06T00:05:24Z import scala.math.round commit 10cbb37a7881867d801ae6630ffc0d09b3feebf9 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-08T06:31:40Z provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset commit f38a1b59e27907f2aa9bd732c5f9147b738d3a0f Author: Debasish Das debasish@one.verizon.com Date: 2014-11-08T06:45:13Z use sampleByKey for per user sampling commit d144f57a58c9424365f1242f90961386c016641e Author: Debasish Das debasish@one.verizon.com Date: 2014-11-12T04:56:46Z recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top commit 7163a5c21b394d8bd89694a9f08aa1b446c71956 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-19T21:58:45Z Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split commit 3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-19T23:38:45Z fixed spark coding style for imports commit ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9 Author: Debasish Das debasish@one.verizon.com Date: 2015-04-01T01:52:27Z addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization commit 98fa4243dc6041290bdde51e1e899a8be7576470 Author: Debasish Das debasish@one.verizon.com Date: 2015-04-01T01:59:57Z updated with master commit 3a0c4eb7f81ee0845f4945d395f6652c965f941b Author: Debasish Das debasish@one.verizon.com Date: 2015-04-01T04:31:01Z updated with spark master commit 3640409ac2dd2ea7ab5e67a520726f2387d137e3 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-02T23:17:45Z MAP calculation driver added to MovieLensALS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 closed the pull request at: https://github.com/apache/spark/pull/5869 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5869#issuecomment-98827606 @mengxr if you could please point to the ML pipeline module where I should add it, I can do the change... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5869#issuecomment-98504419 Implicit lambda should not affect the explicit resultsI will take a closer look into the recommendForAll and compare with my old version.. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5869#issuecomment-98491289 @srowen ideally we should move both the utilities to compute rmse and MAP on a MatrixFactorizationModel to a common place from examples since they are the APIs that user can directly call during the model cross validation..may be it can be moved into ml pipeline ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5869#issuecomment-98502996 Stats from my old run: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics map ~/datasets/ml-1m/ratings.dat rank = default Got 1000209 ratings from 6040 users on 3706 movies. Training: 800187, test: 200022. Test users 6035 MAP 0.03499984595868497 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5869#issuecomment-98504679 RMSE is similar in my old runs..so the ALS core did not change...the MAP driver code is also same since I just migrated it from my PR. TUSCA09LMLVT00C:spark-irmetrics v606014$ ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics rmse ~/datasets/ml-1m/ratings.dat 2015-05-03 09:58:04.904 java[33124:1903] Unable to load realm mapping info from SCDynamicStore 15/05/03 09:58:06 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Got 1000209 ratings from 6040 users on 3706 movies. Training: 800952, test: 199257. Test RMSE = 0.8558204583570717 I will compare the recommendForAll output from my branch and the merged code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4231][MLLIB][Examples] MAP calculation ...
GitHub user debasish83 opened a pull request: https://github.com/apache/spark/pull/5869 [SPARK-4231][MLLIB][Examples] MAP calculation added to examples.MovieLensALS MAP calculation driver to MovieLensALS was not part of SPARK-3066 merge. Added the driver in this PR. @mengxr the results changed compared to my old runs. Any idea if some internal ALS tuning has changed (I remember per user regularization change for implicit feedback but that should not change explicit results) ? MAP calculation: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics map ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. Training: 800163, test: 200046. Test users 6035 MAP 0.019697998843987024 RMSE calculation: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics rmse ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. Training: 800116, test: 200093. Test RMSE = 0.8558133665979457 You can merge this pull request into a Git repository by running: $ git pull https://github.com/debasish83/spark irmetrics Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/5869.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5869 commit 9b3951f558e5673eb475c575f14876421b5a3abc Author: Debasish Das debasish@one.verizon.com Date: 2014-11-05T01:23:09Z validate user/product on MovieLens dataset through user input and compute map measure along with rmse commit cd3ab31cb9b244bae2b45396a6269ed1dc59151b Author: Debasish Das debasish@one.verizon.com Date: 2014-11-05T22:43:11Z merged with AbstractParams serialization bug commit 4bbae0f248ca8747b47ecf852d5aba19c9b39dab Author: Debasish Das debasish@one.verizon.com Date: 2014-11-05T23:23:02Z comments fixed as per scalastyle commit 9fa063e1eb172d68248e03797a54acc738543592 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-06T00:05:24Z import scala.math.round commit 10cbb37a7881867d801ae6630ffc0d09b3feebf9 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-08T06:31:40Z provide ratio for topN product validation; generate MAP and prec@k metric for movielens dataset commit f38a1b59e27907f2aa9bd732c5f9147b738d3a0f Author: Debasish Das debasish@one.verizon.com Date: 2014-11-08T06:45:13Z use sampleByKey for per user sampling commit d144f57a58c9424365f1242f90961386c016641e Author: Debasish Das debasish@one.verizon.com Date: 2014-11-12T04:56:46Z recommendAll API to MatrixFactorizationModel, uses topK finding using BoundedPriorityQueue similar to RDD.top commit 7163a5c21b394d8bd89694a9f08aa1b446c71956 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-19T21:58:45Z Added API for batch user and product recommendation; MAP calculation for product recommendation per user using randomized split commit 3f97c499004aa58dfa1b51b8d2cbd6e5776f5fb1 Author: Debasish Das debasish@one.verizon.com Date: 2014-11-19T23:38:45Z fixed spark coding style for imports commit ee9957144bc2d145c91fc4a4b894ccd2ee6bc2b9 Author: Debasish Das debasish@one.verizon.com Date: 2015-04-01T01:52:27Z addressed initial review comments;merged with master;added tests for batch predict APIs in matrix factorization commit 98fa4243dc6041290bdde51e1e899a8be7576470 Author: Debasish Das debasish@one.verizon.com Date: 2015-04-01T01:59:57Z updated with master commit 3a0c4eb7f81ee0845f4945d395f6652c965f941b Author: Debasish Das debasish@one.verizon.com Date: 2015-04-01T04:31:01Z updated with spark master commit 3640409ac2dd2ea7ab5e67a520726f2387d137e3 Author: Debasish Das debasish@one.verizon.com Date: 2015-05-02T23:17:45Z MAP calculation driver added to MovieLensALS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail
[GitHub] spark pull request: [MLLIB][SPARK-4675] Find similar products and ...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3536#issuecomment-98425139 @MLnick @srowen I did an experiment where I computed brute force topK similar items using cosine distance and compared the intersection with item factor based brute force topK similar items using cosine distance after running implicit factorization...intersection is only 42%...this is inline with Google Correlate paper where they have to do an additional reorder step in real feature space to increase the recall (intersect)...did you guys also see similar results for item-item validation ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5829#issuecomment-98058780 @mengxr looks good to me...I will fix SPARK-4321 based on this merge...I need blockify for rowSimilarities (tall skinny sparse matrices for row similarities)...should we extract it out to IndexedRow ? I can do that cleanup in my row similarities PR... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3890#issuecomment-98188550 @dlwh we should simply use your smooth max and make max(0, 1 - ya'x) differentiable for the first version...that needs no change to breeze...and then if needed we use the paepr...don't you have log sum exp f and grad already implemented in breeze that can be used ? I can help with soft-max alpha tuning if @loachli can put together the formulation in mllib... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3890#issuecomment-98190811 I mean for svm the formulation is over all rows right...the smooth max will be done on every row and label...max(0, 1 - y_i a_i*x)...so only change will be a diff function that calculates the logsumexp and gradient of logsumexp from each data row and we aggregate it on the master and solve using BFGS...as long as the alpha of logsumexp has been tuned (smooth at first, as we go down, tighten it) BFGS will converge to a good solution... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3890#issuecomment-98189044 nope...logistic is feature space...svm is data space...the gradient calculation / BFGS CostFun will change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3890#issuecomment-98073720 this is linear svm strictly in primal form...there are ways to fix it through going to dual space but that needs a linear / nonlinear kernel generation which might be an overkill --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3890#issuecomment-98073658 @loachli hinge loss in linear svm is max(0, 1 - y*a'x) right ? Just replace max with a smooth max and you should be able to smooth hinge gradient and then it can be directly aggregated on master and solved by BFGS...smooth max has an alpha that you can tune over iteration...start with a large lambda (smooth) and tighten it as you go down...breeze already has smooth max and grad implemented I think... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/5829#discussion_r29494261 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -137,20 +141,113 @@ class MatrixFactorizationModel( MatrixFactorizationModel.SaveLoadV1_0.save(this, path) } + /** + * Recommends topK products for all users. + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +MatrixFactorizationModel.recommendForAll(rank, userFeatures, productFeatures, num).map { + case (user, top) = +val ratings = top.map { case (product, rating) = Rating(user, product, rating) } +(user, ratings) +} + } + + + /** + * Recommends topK users for all products. + * + * @param num how many users to return for every product. + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +MatrixFactorizationModel.recommendForAll(rank, productFeatures, userFeatures, num).map { + case (product, top) = +val ratings = top.map { case (user, rating) = Rating(user, product, rating) } +(product, ratings) +} + } +} + +object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { + + import org.apache.spark.mllib.util.Loader._ + + /** + * Makes recommendations for a single user (or product). + */ private def recommend( recommendToFeatures: Array[Double], recommendableFeatures: RDD[(Int, Array[Double])], num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = +val scored = recommendableFeatures.map { case (id, features) = (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) } scored.top(num)(Ordering.by(_._2)) } -} -object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { + /** + * Makes recommendations for all users (or products). + * @param rank rank + * @param srcFeatures src features to receive recommendations + * @param dstFeatures dst features used to make recommendations + * @param num number of recommendations for each record + * @return an RDD of (srcId: Int, recommendations), where recommendations are stored as an array + * of (dstId, rating) pairs. + */ + private def recommendForAll( + rank: Int, + srcFeatures: RDD[(Int, Array[Double])], + dstFeatures: RDD[(Int, Array[Double])], + num: Int): RDD[(Int, Array[(Int, Double)])] = { +val srcBlocks = blockify(rank, srcFeatures) +val dstBlocks = blockify(rank, dstFeatures) +val ratings = srcBlocks.cartesian(dstBlocks).flatMap { --- End diff -- I also like it better as it should scale fine assuming cartesian keys are under control...say to 100M x 10M with 400 factors --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/5829#discussion_r29492705 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala --- @@ -39,7 +39,7 @@ class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Se * @return an RDD that contains the top k values for each key */ def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = { -self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))( +self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord.reverse))( --- End diff -- I have to look closely into it tomorrow...I have been using topByKey internally and did not remember seeing this bug... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/5829#discussion_r29492840 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/rdd/MLPairRDDFunctions.scala --- @@ -39,7 +39,7 @@ class MLPairRDDFunctions[K: ClassTag, V: ClassTag](self: RDD[(K, V)]) extends Se * @return an RDD that contains the top k values for each key */ def topByKey(num: Int)(implicit ord: Ordering[V]): RDD[(K, Array[V])] = { -self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord))( +self.aggregateByKey(new BoundedPriorityQueue[V](num)(ord.reverse))( --- End diff -- yup topByKey behavior as implemented was correct... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231: Add RankingMetrics to exam...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-98073840 Changed the title to add driver for recommendAll API once SPARK-3066 merges to master... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3066][MLLIB] Support recommendAll in ma...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/5829#discussion_r29493623 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -137,20 +141,113 @@ class MatrixFactorizationModel( MatrixFactorizationModel.SaveLoadV1_0.save(this, path) } + /** + * Recommends topK products for all users. + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +MatrixFactorizationModel.recommendForAll(rank, userFeatures, productFeatures, num).map { + case (user, top) = +val ratings = top.map { case (product, rating) = Rating(user, product, rating) } +(user, ratings) +} + } + + + /** + * Recommends topK users for all products. + * + * @param num how many users to return for every product. + * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array + * of rating objects which contains the recommended userId, same productID and a score in the + * rating field. Semantics of score is same as recommendUsers API + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +MatrixFactorizationModel.recommendForAll(rank, productFeatures, userFeatures, num).map { + case (product, top) = +val ratings = top.map { case (user, rating) = Rating(user, product, rating) } +(product, ratings) +} + } +} + +object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { + + import org.apache.spark.mllib.util.Loader._ + + /** + * Makes recommendations for a single user (or product). + */ private def recommend( recommendToFeatures: Array[Double], recommendableFeatures: RDD[(Int, Array[Double])], num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = +val scored = recommendableFeatures.map { case (id, features) = (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) } scored.top(num)(Ordering.by(_._2)) } -} -object MatrixFactorizationModel extends Loader[MatrixFactorizationModel] { + /** + * Makes recommendations for all users (or products). + * @param rank rank + * @param srcFeatures src features to receive recommendations + * @param dstFeatures dst features used to make recommendations + * @param num number of recommendations for each record + * @return an RDD of (srcId: Int, recommendations), where recommendations are stored as an array + * of (dstId, rating) pairs. + */ + private def recommendForAll( + rank: Int, + srcFeatures: RDD[(Int, Array[Double])], + dstFeatures: RDD[(Int, Array[Double])], + num: Int): RDD[(Int, Array[(Int, Double)])] = { +val srcBlocks = blockify(rank, srcFeatures) +val dstBlocks = blockify(rank, dstFeatures) +val ratings = srcBlocks.cartesian(dstBlocks).flatMap { --- End diff -- Normally items are skinny ~ 1M...and ranks are low...50...so 1Mx50 bytes ~ 50 MB...with 8M products, its 400 MB...I still think that cartesian will be slower than the version I added in terms of runtimedid you run any benchmark with the old code ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-98021307 @mengxr please go ahead... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-96403986 was very last few weeks...update it in next few days... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-91869124 ohh sorry I don't know about requester pays...let me look into it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-91710700 @jkbradley we still could not access the wikipedia dataset on ec2...will it be possible for you to upload the 1 Billion token dataset on EC2 ? I wanted to do a sparse coding scalability run on the large dataset as well... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-91710827 @jkbradley let me know if you need vzcloud access and I can create few nodes for you...ec2 might be easier for other's to access it as well... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-90950074 if you look into breeze.optimize.proximal.Proximal, I added a library of projection/proximal operators...in my experiments looks like projection based algorithms (SPG for example) does not work for L1 and sparsity constraint that well but works well for positivity and bounds for example...I am thinking to extend breeze linear CG / NNLS to handle simple projections and hopefully consolidate both into one linear CG with projection... I support these constraints through a cholesky/LDL based ADMM solver but I wanted to write an iterative version using linear CG to see if ADMM performance can be improved...For well conditioned QPs papers have found ADMM faster than FISTA but I did not see comparisons with linear CG variant... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-90942562 @tmyklebu do you have the original NNLS paper in english ? Breeze also has a linear CG...I am thinking if it is possible to merge simple projections like positivity and bounds with the linear CG...CG based linear solves can be extended to handle projection similar to SPG...But NNLS looks like does some specific optimization for x = 0...can NNLS be extended to other projection/proximal operators ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-90950364 Application is topic modeling using Sparsity constraints like L1 and probability simplex and supporting bounds in ALS --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-90753041 @mengxr @josephk In my internal testing, I am finding the sparse formulations useful for extracting genre/topic information out of netflix/movielens dataset...The formulations are: 1. Sparse coding: L2 on users/words, L1 on documents/movies 2. L2 on users/words, probability simplex on documents/movies The reference: 2011 Sparse Latent Semantic Analysis LSA(some of it is implemented in Graphlab): https://www.cs.cmu.edu/~xichen/images/SLSA-sdm11-final.pdf showed sparse coding producing better result than LDA...I am considering if it makes sense to add a 20 newsgroup flow in examples that was shown in the paper ? Also do we have perplexity implemented so that we can start comparing topic models...The ALS runtime with sparse formulations are also pretty good --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-90753271 Sure...Let me do that and point you to the repo...most likely it will be a breeze based branch and I will copy the mllib implementation over thr... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729377 I meant MAP...what's the MAP on netflix dataset you have seen before and with what lambda ? I am running MAP experiments with various factorization formulations including loglikelihood loss with normalization constraints...also how do you define MAP for implicit feedback (binary dataset, click is 1 and no click is 0) ? In the label set every rating is 1.0 and so there is no ranking defined as such... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89729777 agreed with the implicit MAP calculationFor netflix dataset, I got 0.014...May be I need to use a better regularization...was that 0.05-0.1 number from using lambda = 0.065 ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27769592 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -167,23 +169,66 @@ object MovieLensALS { .setProductBlocks(params.numProductBlocks) .run(training) -val rmse = computeRmse(model, test, params.implicitPrefs) - -println(sTest RMSE = $rmse.) +params.metrics match { + case rmse = +val rmse = computeRmse(model, test, params.implicitPrefs) +println(sTest RMSE = $rmse) + case map = +val (map, users) = computeRankingMetrics(model, training, test, numMovies.toInt) +println(sTest users $users MAP $map) + case _ = println(sMetrics not defined, options are rmse/map) +} sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ - def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) -: Double = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - + def computeRmse( +model: MatrixFactorizationModel, +data: RDD[Rating], +implicitPrefs: Boolean) : Double = { val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r + } + + /** Compute MAP (Mean Average Precision) statistics for top N product Recommendation */ + def computeRankingMetrics( +model: MatrixFactorizationModel, +train: RDD[Rating], +test: RDD[Rating], +n: Int) : (Double, Long) = { +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { --- End diff -- I will update with topByKeyIs there a better place to move this function ? may be inside ALS object for example ? That way I can add a test-case to guard it ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-89594722 @mengxr any insight on it ? the runtime issue is only in first iteration and I think you can point out if there is any obvious issue in the way I call the solver...looks like something to do with initialization... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89697236 @srowen For netflix dataset what's the MAP you have seen before...I started experiments on Netflix dataset...lambda is 0.065 for netflix as well right ? For MovieLens 0.065 works well... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-89706247 @coderxiang @mengxr If I have a dataset with implicit (click or 0) then MAP is not that well defined right since in label set everything is 1.0 and so there is no ordering definedshould we add a rank independent metric for implicit datasets ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27712646 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -138,14 +141,122 @@ class MatrixFactorizationModel( } private def recommend( - recommendToFeatures: Array[Double], - recommendableFeatures: RDD[(Int, Array[Double])], - num: Int): Array[(Int, Double)] = { -val scored = recommendableFeatures.map { case (id,features) = - (id, blas.ddot(features.length, recommendToFeatures, 1, features, 1)) +recommendToFeatures: Array[Double], +recommendableFeatures: RDD[(Int, Array[Double])], +num: Int): Array[(Int, Double)] = { +val recommendToVector = Vectors.dense(recommendToFeatures) +val scored = recommendableFeatures.map { + case (id, features) = +(id, BLAS.dot(recommendToVector, Vectors.dense(features))) } scored.top(num)(Ordering.by(_._2)) } + + /** + * Recommends topK products for all users + * + * @param num how many products to return for every user. + * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of + * rating objects which contains the same userId, recommended productID and a score in the + * rating field. Semantics of score is same as recommendProducts API + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } --- End diff -- For cross validation we use variable num internally but for final recommendation global num is fine...I thought having a topK rdd satisfies both use-cases... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88346990 I reran the map computation on MovieLens with varying ranks: Example run: ./bin/spark-submit --master spark://TUSCA09LMLVT00C.local:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 4 --executor-memory 4g --driver-memory 1g ./examples/target/spark-examples_2.10-1.4.0-SNAPSHOT.jar --lambda 0.065 --metrics map ~/datasets/ml-1m/ratings.dat rank = default Got 1000209 ratings from 6040 users on 3706 movies. Training: 800187, test: 200022. Test users 6035 MAP 0.03499984595868497 rank = 25 Got 1000209 ratings from 6040 users on 3706 movies. Training: 799385, test: 200824. Test users 6034 MAP 0.042580954047373255 rank = 50 Got 1000209 ratings from 6040 users on 3706 movies. Training: 800289, test: 199920. Test users 6036 MAP 0.048958415806933275 rank = 100 Got 1000209 ratings from 6040 users on 3706 movies. Training: 801148, test: 199061. Test users 6038 MAP 0.05503487765882986 The numbers are consistent with my runs before. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88347022 @mengxr could you please do another passI might have missed the JavaRDD compatibility issue but fixed rest of your comments... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27533769 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) --- End diff -- I am bit confused...recommendProducts is also a public member but that's not in companion object...recommendProductsForUsers is also very similar API right ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27525485 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -74,6 +75,9 @@ object MovieLensALS { opt[Unit](implicitPrefs) .text(use implicit preference) .action((_, c) = c.copy(implicitPrefs = true)) + opt[Unit](validateRecommendation) --- End diff -- cleaned up --validateRecommendation to --metrics --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528071 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, --- End diff -- followed the indentation from current code --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528120 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528991 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey.map { + case (userId, products) = (userId, products.toArray) +} + +val rankings = model.recommendProductsForUsers(n).join(trainUserLabels).map { + case (userId, (pred, train)) = { +val predictedProducts = pred.map { _.product } --- End diff -- done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528959 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) +} + +val trainUserLabels = train.map { + x = (x.user, x.product) +}.groupByKey.map { + case (userId, products) = (userId, products.toArray) --- End diff -- merged --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27525568 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) --- End diff -- fixed...can be fit in one line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528198 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27528238 --- Diff: examples/src/main/scala/org/apache/spark/examples/mllib/MovieLensALS.scala --- @@ -171,18 +175,62 @@ object MovieLensALS { println(sTest RMSE = $rmse.) +if (params.validateRecommendation) { + val (map, users) = computeRankingMetrics(model, +training, test, numMovies.toInt) + println(sTest users $users MAP $map) +} + sc.stop() } /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], implicitPrefs: Boolean) = { - -def mapPredictedRating(r: Double) = if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) else r - val predictions: RDD[Rating] = model.predict(data.map(x = (x.user, x.product))) -val predictionsAndRatings = predictions.map{ x = - ((x.user, x.product), mapPredictedRating(x.rating)) +val predictionsAndRatings = predictions.map { x = + ((x.user, x.product), mapPredictedRating(x.rating, implicitPrefs)) }.join(data.map(x = ((x.user, x.product), x.rating))).values math.sqrt(predictionsAndRatings.map(x = (x._1 - x._2) * (x._1 - x._2)).mean()) } + + def mapPredictedRating(r: Double, implicitPrefs: Boolean) = { +if (implicitPrefs) math.max(math.min(r, 1.0), 0.0) +else r + } + + /** + * Compute MAP (Mean Average Precision) statistics for top N product Recommendation + */ + def computeRankingMetrics(model: MatrixFactorizationModel, +train: RDD[Rating], test: RDD[Rating], n: Int) = { + +val ord = Ordering.by[(Int, Double), Double](x = x._2) + +val testUserLabels = test.map { + x = (x.user, (x.product, x.rating)) +}.groupByKey.map { + case (userId, products) = +val sortedProducts = products.toArray.sorted(ord.reverse) +(userId, sortedProducts.map { _._1 }) --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529347 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD *and the features computed for this product. */ class MatrixFactorizationModel private[mllib] ( -val rank: Int, -val userFeatures: RDD[(Int, Array[Double])], -val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { + val rank: Int, + val userFeatures: RDD[(Int, Array[Double])], + val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { /** Predict the rating of one user for one product. */ def predict(user: Int, product: Int): Double = { -val userVector = new DoubleMatrix(userFeatures.lookup(user).head) -val productVector = new DoubleMatrix(productFeatures.lookup(product).head) -userVector.dot(productVector) +val userVector = Vectors.dense(userFeatures.lookup(user).head) --- End diff -- I cleaned netlib.ddot to BLAS.dot...they will be same for these cases --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529308 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -35,33 +41,33 @@ import org.apache.spark.rdd.RDD *and the features computed for this product. */ class MatrixFactorizationModel private[mllib] ( -val rank: Int, -val userFeatures: RDD[(Int, Array[Double])], -val productFeatures: RDD[(Int, Array[Double])]) extends Serializable { + val rank: Int, --- End diff -- after merge this is fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529218 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -17,14 +17,20 @@ package org.apache.spark.mllib.recommendation -import java.lang.{Integer = JavaInteger} - -import org.jblas.DoubleMatrix +import java.lang.{ Integer = JavaInteger } import org.apache.spark.SparkContext._ -import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} +import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD } import org.apache.spark.rdd.RDD +import org.apache.spark.util.collection.Utils +import org.apache.spark.util.BoundedPriorityQueue + +import scala.Ordering --- End diff -- By organizing imports you mean same package imports will move to one right ? Old: import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.linalg.BLAS New: import org.apache.spark.mllib.linalg.{Vectors, Vector, BLAS} --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529231 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -17,14 +17,20 @@ package org.apache.spark.mllib.recommendation -import java.lang.{Integer = JavaInteger} - -import org.jblas.DoubleMatrix +import java.lang.{ Integer = JavaInteger } import org.apache.spark.SparkContext._ -import org.apache.spark.api.java.{JavaPairRDD, JavaRDD} +import org.apache.spark.api.java.{ JavaPairRDD, JavaRDD } --- End diff -- cleaned --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27529681 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD + */ + def recommendProductsForUsers( +userTopK: RDD[(Int, Int)]): RDD[(Int, Array[Rating])] = { +val userFeaturesTopK = userFeatures.join(userTopK).map { + case (userId, (userFeature, topK)) = +(userId, FeatureTopK(Vectors.dense(userFeature), topK)) +} +val productVectors = productFeatures.map { + x = (x._1, Vectors.dense(x._2)) +}.collect + +userFeaturesTopK.map { + case (userId, userFeatureTopK) = { +val predictions = productVectors.map { + case (productId, productVector) = +Rating(userId, productId, + BLAS.dot(userFeatureTopK.feature, productVector)) --- End diff -- I will bring in lot of level 3 BLAS in the next PR...I am writing the dgemv and dgemm versions for several of these APIs...For now I will add a TODO --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on a diff in the pull request: https://github.com/apache/spark/pull/3098#discussion_r27535273 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala --- @@ -103,13 +109,106 @@ class MatrixFactorizationModel private[mllib] ( recommend(productFeatures.lookup(product).head, userFeatures, num) .map(t = Rating(t._1, product, t._2)) + /** + * Recommends topK users/products. + * + * @param num how many users to return. The number returned may be less than this. + * @return [Array[Rating]] objects, each of which contains a userID, the given productID and a + * score in the rating field. Each represents one recommended user, and they are sorted + * by score, decreasing. The first returned is the one predicted to be most strongly + * recommended to the product. The score is an opaque value that indicates how strongly + * recommended the user is. + */ + + /** + * Recommend topK products for all users + */ + def recommendProductsForUsers(num: Int): RDD[(Int, Array[Rating])] = { +val topK = userFeatures.map { x = (x._1, num) } +recommendProductsForUsers(topK) + } + + /** + * Recommend topK users for all products + */ + def recommendUsersForProducts(num: Int): RDD[(Int, Array[Rating])] = { +val topK = productFeatures.map { x = (x._1, num) } +recommendUsersForProducts(topK) + } + + val ord = Ordering.by[Rating, Double](x = x.rating) + case class FeatureTopK(feature: Vector, topK: Int) + + /** + * Recommend topK products for users in userTopK RDD --- End diff -- documented the public batch prediction APIs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88291470 @mengxr I also added 2 test-cases for batch predict APIs. These features are useful if users are interested in computing MAP measures...Let me know if I move the function computeRankingMetrics and computeRMSE to the companion class of ml.recommendation.ALS ? Currently both of them are in examples... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] SPARK-4231, SPARK-3066: Add RankingMet...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3098#issuecomment-88292172 If we move computeRankingMetrics and computeRMSE to a better place, I can guard it through tests... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-87342283 What are MiMa tests ? I am bit confused on it... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-87276063 Updated the PR with breeze 0.11.2...Except first iteration, rest of them are at par: Breeze NNLS: TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime ./work/app-20150328110507-0003/0/stderr 15/03/28 11:05:16 INFO ALS: solveTime 228.358 ms 15/03/28 11:05:16 INFO ALS: solveTime 80.773 ms 15/03/28 11:05:17 INFO ALS: solveTime 96.837 ms 15/03/28 11:05:17 INFO ALS: solveTime 92.252 ms 15/03/28 11:05:18 INFO ALS: solveTime 55.923 ms 15/03/28 11:05:18 INFO ALS: solveTime 53.503 ms 15/03/28 11:05:19 INFO ALS: solveTime 96.827 ms 15/03/28 11:05:20 INFO ALS: solveTime 99.835 ms 15/03/28 11:05:20 INFO ALS: solveTime 56.032 ms 15/03/28 11:05:21 INFO ALS: solveTime 55.832 ms mllib NNLS: TUSCA09LMLVT00C:spark-brznnls v606014$ grep solveTime ./work/app-20150328110532-0004/0/stderr 15/03/28 11:05:41 INFO ALS: solveTime 92.086 ms 15/03/28 11:05:41 INFO ALS: solveTime 59.103 ms 15/03/28 11:05:42 INFO ALS: solveTime 80.177 ms 15/03/28 11:05:42 INFO ALS: solveTime 78.755 ms 15/03/28 11:05:43 INFO ALS: solveTime 51.966 ms 15/03/28 11:05:43 INFO ALS: solveTime 46.426 ms 15/03/28 11:05:44 INFO ALS: solveTime 93.656 ms 15/03/28 11:05:44 INFO ALS: solveTime 84.458 ms 15/03/28 11:05:45 INFO ALS: solveTime 49.22 ms 15/03/28 11:05:45 INFO ALS: solveTime 45.626 ms export solver=mllib runs the mllib NNLS...I will wait for the feedbacks... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-86949884 @mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-86950106 @mengxr any updates on it ? breeze 0.11.2 is now integrated with Spark...I can clean up the PR for reviews --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-87165211 I integrated with Breeze 0.11.2. Only visible difference is first iteration Breeze QuadraticMinimizer: TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221722-/0/stderr 15/03/27 22:17:32 INFO ALS: solveTime 234.153 ms 15/03/27 22:17:32 INFO ALS: solveTime 82.499 ms 15/03/27 22:17:33 INFO ALS: solveTime 83.579 ms 15/03/27 22:17:33 INFO ALS: solveTime 83.039 ms 15/03/27 22:17:34 INFO ALS: solveTime 35.545 ms 15/03/27 22:17:34 INFO ALS: solveTime 30.707 ms 15/03/27 22:17:35 INFO ALS: solveTime 53.025 ms 15/03/27 22:17:36 INFO ALS: solveTime 53.021 ms 15/03/27 22:17:36 INFO ALS: solveTime 31.329 ms 15/03/27 22:17:37 INFO ALS: solveTime 32.136 ms mllib CholeskySolver: TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221/0/stderr app-20150327221722-/ app-20150327221803-0001/ TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150327221803-0001/0/stderr 15/03/27 22:18:11 INFO ALS: solveTime 98.692 ms 15/03/27 22:18:12 INFO ALS: solveTime 38.997 ms 15/03/27 22:18:12 INFO ALS: solveTime 62.361 ms 15/03/27 22:18:13 INFO ALS: solveTime 60.316 ms 15/03/27 22:18:13 INFO ALS: solveTime 36.569 ms 15/03/27 22:18:14 INFO ALS: solveTime 36.321 ms 15/03/27 22:18:14 INFO ALS: solveTime 60.007 ms 15/03/27 22:18:15 INFO ALS: solveTime 59.771 ms 15/03/27 22:18:15 INFO ALS: solveTime 36.519 ms 15/03/27 22:18:16 INFO ALS: solveTime 38.295 ms Visible difference is in first 2 iterations as showed in previous experiments as well. I fixed the random seed test now and so different runs will not produce the same result. I need this structure to build ALM as ALM extends mllib.ALS and adds LossType in constructor along with userConstraint and itemConstraint... Right now I am experimenting with LeastSquare (for tests with ALS) and I am experimenting with LeastSquare and LogLikelihood loss... For this PR I have updated MovieLensALS with userConstraint and itemConstraint and I am considering if we should add a Sparse Coding formulation in examples now or we bring that in a separate PR ? I have not cleaned up CholeskySolver from ALS yet and waiting for the feedbacks but I have added test-cases in ml.ALSSuite for all the constraintsAt ALS flow level I need to construct more test-cases and I can bring them in separate PR as well... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-85814758 @mengxr I discussed with David and the only reason I can think of is that inside the solvers I am using DenseMatrix and DenseVector in-place of primitive arrays for workspace creationthat might be causing the first iteration runtime difference due to loading up the interface classes and other features that comes with DenseMatrix and DenseVector...I can move to primitive arrays for the workspace but then the code will look ugly...Let me know if I should ? I am surprised that this issue does not show up after the first call ! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-84827225 I looked more into it and I will open up an API in Breeze QuadraticMinimizer where in-place of DenseMatrix gram, upper triangular gram can be sent but the inner workspace has to be n x n because for Cholesky we need to compute LL' and for Quasi Definite System we have to compute LDL' / LU and both of them need n x n space...so I won't be able to decrease the QuadraticMinimizer workspace size...for dposv BLAS allocates memory for LL' and it is not visible to user... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML] SPARK-2426: Integrate Breeze NNLS with ML...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/5005#issuecomment-85348266 All the runtime enhancements are being added to Breeze in this PR: https://github.com/scalanlp/breeze/pull/386 Please let me know if there are additional feedbacks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-85351062 All the runtime enhancements are being added to Breeze in this PR: https://github.com/scalanlp/breeze/pull/386 Please let me know if there are additional feedbacks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-85161041 @mengxr I added the optimization for lower triangular matrix and now they are very close...Let me know what do you think and if there are any other tricks you would like me to try...Note that with these optimization, QuadraticMinimizer with POSITIVE constraint will also run much faster Breeze QuadraticMinimizer (default): unset solver; ./bin/spark-submit --master spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint SMOOTH productConstraint SMOOTH Running Breeze QuadraticMinimizer for users with constraint SMOOTH Running Breeze QuadraticMinimizer for items with constraint SMOOTH Test RMSE = 2.4985081126233846. 15/03/23 12:26:55 INFO ALS: solveTime 205.379 ms 15/03/23 12:26:55 INFO ALS: solveTime 72.116 ms 15/03/23 12:26:56 INFO ALS: solveTime 74.034 ms 15/03/23 12:26:56 INFO ALS: solveTime 77.379 ms 15/03/23 12:26:57 INFO ALS: solveTime 36.532 ms 15/03/23 12:26:57 INFO ALS: solveTime 29.775 ms 15/03/23 12:26:58 INFO ALS: solveTime 48.925 ms 15/03/23 12:26:58 INFO ALS: solveTime 51.904 ms 15/03/23 12:26:59 INFO ALS: solveTime 30.882 ms 15/03/23 12:26:59 INFO ALS: solveTime 30.658 ms ML CholeskySolver: export solver=mllib; ./bin/spark-submit --master spark://tusca09lmlvt00c.uswin.ad.vzwcorp.com:7077 --class org.apache.spark.examples.mllib.MovieLensALS --jars ~/.m2/repository/com/github/scopt/scopt_2.10/3.2.0/scopt_2.10-3.2.0.jar --total-executor-cores 1 ./examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar --rank 50 --numIterations 2 ~/datasets/ml-1m/ratings.dat Got 1000209 ratings from 6040 users on 3706 movies. Training: 800670, test: 199539. Quadratic minimization userConstraint SMOOTH productConstraint SMOOTH Test RMSE = 2.4985081126233846. TUSCA09LMLVT00C:spark-qp-als v606014$ grep solveTime ./work/app-20150323122612-0002/0/stderr 15/03/23 12:26:20 INFO ALS: solveTime 102.243 ms 15/03/23 12:26:21 INFO ALS: solveTime 38.195 ms 15/03/23 12:26:21 INFO ALS: solveTime 60.583 ms 15/03/23 12:26:22 INFO ALS: solveTime 59.882 ms 15/03/23 12:26:22 INFO ALS: solveTime 36.59 ms 15/03/23 12:26:23 INFO ALS: solveTime 36.021 ms 15/03/23 12:26:23 INFO ALS: solveTime 59.271 ms 15/03/23 12:26:24 INFO ALS: solveTime 59.217 ms 15/03/23 12:26:24 INFO ALS: solveTime 36.344 ms 15/03/23 12:26:25 INFO ALS: solveTime 35.838 ms I am running only 2 iterations but you can see in the tail the solvers run at par... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLib]SPARK-5027:add SVMWithLBFGS interface i...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3890#issuecomment-84624771 Can we discuss it in JIRA ? For svm with owlqn what's the orthant wise constraint you are adding ? There are ways to handle the max differentiability in bfgs as well but I am not sure how well it works... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [ML][MLLIB] SPARK-2426: Integrate Breeze Quadr...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/3221#issuecomment-84643641 I am adding ml.QuadraticSolver tests that builds upon normal equation (similar to CholeskySolver tests) for 1 - 5 basically...will update in a bit... --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [MLLIB] [spark-2352] Implementation of an Arti...
Github user debasish83 commented on the pull request: https://github.com/apache/spark/pull/1290#issuecomment-84708094 @witgo there are lot of useful building blocks in your RBM PR...are you planning to consolidate them in this PR ? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org