[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-03-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16395014#comment-16395014
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

So, the basic implementation is ready. Please, feel free to try it out. 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Assignee: Apache Spark
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-03-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383980#comment-16383980
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

I've created a repo. https://github.com/akopich/spark-gp

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Assignee: Apache Spark
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-03-01 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16381831#comment-16381831
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

What does the assignment to Apache Spark mean?

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Assignee: Apache Spark
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-17 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16368198#comment-16368198
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

[~sethah], thanks for your input.

I believe, GPflow implements linear time GP. However, it is not distributed. 

Regarding investigation of user demand: can't we just hold a vote among the 
users? 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-16 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366857#comment-16366857
 ] 

Valeriy Avanesov commented on SPARK-23437:
--

[~mlnick], is that really supposed to happen to a textbook algorithm filling in 
the vacuum? There is currently no non-parametric regression techniques 
inferring a smooth function provided by MLlib. 

Regarding the guidelines: the requirements for the algorithm are 
 # Be widely known
 # Be used and accepted (academic citations and concrete use cases can help 
justify this)
 # Be highly scalable

and I think all of them hold (see the original post). 

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23437) [ML] Distributed Gaussian Process Regression for MLlib

2018-02-15 Thread Valeriy Avanesov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Valeriy Avanesov updated SPARK-23437:
-
Summary: [ML] Distributed Gaussian Process Regression for MLlib  (was: 
Distributed Gaussian Process Regression for MLlib)

> [ML] Distributed Gaussian Process Regression for MLlib
> --
>
> Key: SPARK-23437
> URL: https://issues.apache.org/jira/browse/SPARK-23437
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 2.2.1
>Reporter: Valeriy Avanesov
>Priority: Major
>
> Gaussian Process Regression (GP) is a well known black box non-linear 
> regression approach [1]. For years the approach remained inapplicable to 
> large samples due to its cubic computational complexity, however, more recent 
> techniques (Sparse GP) allowed for only linear complexity. The field 
> continues to attracts interest of the researches – several papers devoted to 
> GP were present on NIPS 2017. 
> Unfortunately, non-parametric regression techniques coming with mllib are 
> restricted to tree-based approaches.
> I propose to create and include an implementation (which I am going to work 
> on) of so-called robust Bayesian Committee Machine proposed and investigated 
> in [2].
> [1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
> Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
> The MIT Press.
> [2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian 
> processes. In _Proceedings of the 32nd International Conference on 
> International Conference on Machine Learning - Volume 37_ (ICML'15), Francis 
> Bach and David Blei (Eds.), Vol. 37. JMLR.org 1481-1490.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23437) Distributed Gaussian Process Regression for MLlib

2018-02-15 Thread Valeriy Avanesov (JIRA)
Valeriy Avanesov created SPARK-23437:


 Summary: Distributed Gaussian Process Regression for MLlib
 Key: SPARK-23437
 URL: https://issues.apache.org/jira/browse/SPARK-23437
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Affects Versions: 2.2.1
Reporter: Valeriy Avanesov


Gaussian Process Regression (GP) is a well known black box non-linear 
regression approach [1]. For years the approach remained inapplicable to large 
samples due to its cubic computational complexity, however, more recent 
techniques (Sparse GP) allowed for only linear complexity. The field continues 
to attracts interest of the researches – several papers devoted to GP were 
present on NIPS 2017. 

Unfortunately, non-parametric regression techniques coming with mllib are 
restricted to tree-based approaches.

I propose to create and include an implementation (which I am going to work on) 
of so-called robust Bayesian Committee Machine proposed and investigated in [2].

[1] Carl Edward Rasmussen and Christopher K. I. Williams. 2005. _Gaussian 
Processes for Machine Learning (Adaptive Computation and Machine Learning)_. 
The MIT Press.

[2] Marc Peter Deisenroth and Jun Wei Ng. 2015. Distributed Gaussian processes. 
In _Proceedings of the 32nd International Conference on International 
Conference on Machine Learning - Volume 37_ (ICML'15), Francis Bach and David 
Blei (Eds.), Vol. 37. JMLR.org 1481-1490.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5564) Support sparse LDA solutions

2017-08-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803
 ] 

Valeriy Avanesov edited comment on SPARK-5564 at 8/12/17 10:29 AM:
---

I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.

[~josephkb], what are your thoughts?


was (Author: acopich):
I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.



> Support sparse LDA solutions
> 
>
> Key: SPARK-5564
> URL: https://issues.apache.org/jira/browse/SPARK-5564
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14371) OnlineLDAOptimizer should not collect stats for each doc in mini-batch to driver

2017-08-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16124531#comment-16124531
 ] 

Valeriy Avanesov commented on SPARK-14371:
--

Hi,

I've opened a PR regarding this Jira yesterday
https://github.com/apache/spark/pull/18924

However, something seems to be wrong -- the Jira is still not "in Progress" and 
the PR is not linked to it. Could anyone please check out what's wrong? 

> OnlineLDAOptimizer should not collect stats for each doc in mini-batch to 
> driver
> 
>
> Key: SPARK-14371
> URL: https://issues.apache.org/jira/browse/SPARK-14371
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> See this line: 
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L437
> The second element in each row of "stats" is a list with one Vector for each 
> document in the mini-batch.  Those are collected to the driver in this line:
> https://github.com/apache/spark/blob/5743c6476dbef50852b7f9873112a2d299966ebd/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L456
> We should not collect those to the driver.  Rather, we should do the 
> necessary maps and aggregations in a distributed manner.  This will involve 
> modify the Dirichlet expectation implementation.  (This JIRA should be done 
> by someone knowledge about online LDA and Spark.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5564) Support sparse LDA solutions

2017-08-11 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16123803#comment-16123803
 ] 

Valeriy Avanesov commented on SPARK-5564:
-

I am considering working on this issue. The question is whether there should be 
another EMLDAOptimizerVorontsov or shall the existing EMLDAOptimizer be 
re-written.



> Support sparse LDA solutions
> 
>
> Key: SPARK-5564
> URL: https://issues.apache.org/jira/browse/SPARK-5564
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>
> Latent Dirichlet Allocation (LDA) currently requires that the priors’ 
> concentration parameters be > 1.0.  It should support values > 0.0, which 
> should encourage sparser topics (phi) and document-topic distributions 
> (theta).
> For EM, this will require adding a projection to the M-step, as in: Vorontsov 
> and Potapenko. "Tutorial on Probabilistic Topic Modeling : Additive 
> Regularization for Stochastic Matrix Factorization." 2014.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-12 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14273551#comment-14273551
 ] 

Valeriy Avanesov commented on SPARK-1405:
-

[~josephkb], I've read your proposal and I suggest to consider Stochastic 
Gradient Langevin Dynamics [1]. It was shown be ~100 times faster than Gibbs 
sampling [2]. Though, I'm not sure if it's implementable in terms of RDD. 

[1] 
http://papers.nips.cc/paper/4883-stochastic-gradient-riemannian-langevin-dynamics-on-the-probability-simplex.pdf
[2] http://www.ics.uci.edu/~sungjia/icml2014_dist_v0.2.pdf

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xusen Yin
Assignee: Guoqiang Li
Priority: Critical
  Labels: features
 Attachments: performance_comparison.png

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-11 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14243398#comment-14243398
 ] 

Valeriy Avanesov commented on SPARK-2426:
-

 what's the normalization constraint ? Each row of W should sum upto 1 and 
 each column of H should sum upto 1 with positivity ? 
Yes.

 That is similar to PLSA right except that PLSA will have a bi-concave loss...
There's a completely different loss... BTW, we've used a factorisation with the 
loss you've described as an initial approximation for PLSA. It gave a 
significant speed-up. 

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231375#comment-14231375
 ] 

Valeriy Avanesov edited comment on SPARK-2426 at 12/2/14 11:47 AM:
---

I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

  ||.|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 


was (Author: acopich):
I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

||..|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231375#comment-14231375
 ] 

Valeriy Avanesov commented on SPARK-2426:
-

I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

||..|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2426) Quadratic Minimization for MLlib ALS

2014-12-02 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14231375#comment-14231375
 ] 

Valeriy Avanesov edited comment on SPARK-2426 at 12/2/14 11:47 AM:
---

I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

  \||.|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 


was (Author: acopich):
I'm not sure if I understand your question...

As far as I can see, w_i stands for a row of the matrix w and h_j stands for a 
column of the matrix h.  

\sum_i \sum_j ( r_ij - w_i*h_j) -- is not a matrix norm. Probably, you either 
miss abs or square -- \sum_i \sum_j |r_ij - w_i*h_j| or \sum_i \sum_j ( r_ij - 
w_i*h_j)^2
It looks like l2 regularized stochastic matrix decomposition with respect to 
Frobenius (or l1) norm. But I don't understand why do you consider k 
optimization problems (do you? What does k \in {1 ... 25} stand for?). 

Anyway, l2 regularized stochastic matrix decomposition problem is defined as 
follows 

Minimize w.r.t. W and H : ||R - W*H|| + \lambda(||W|| + ||H||)
under non-negativeness and normalization constraints. 

  ||.|| stands for Frobenius norm (or l1). 

By the way: is the matrix of ranks r stochastic? Stochastic matrix 
decomposition doesn't seem reasonable if it's not. 

 Quadratic Minimization for MLlib ALS
 

 Key: SPARK-2426
 URL: https://issues.apache.org/jira/browse/SPARK-2426
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Debasish Das
Assignee: Debasish Das
   Original Estimate: 504h
  Remaining Estimate: 504h

 Current ALS supports least squares and nonnegative least squares.
 I presented ADMM and IPM based Quadratic Minimization solvers to be used for 
 the following ALS problems:
 1. ALS with bounds
 2. ALS with L1 regularization
 3. ALS with Equality constraint and bounds
 Initial runtime comparisons are presented at Spark Summit. 
 http://spark-summit.org/2014/talk/quadratic-programing-solver-for-non-negative-matrix-factorization-with-spark
 Based on Xiangrui's feedback I am currently comparing the ADMM based 
 Quadratic Minimization solvers with IPM based QpSolvers and the default 
 ALS/NNLS. I will keep updating the runtime comparison results.
 For integration the detailed plan is as follows:
 1. Add QuadraticMinimizer and Proximal algorithms in mllib.optimization
 2. Integrate QuadraticMinimizer in mllib ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2199) Distributed probabilistic latent semantic analysis in MLlib

2014-06-19 Thread Valeriy Avanesov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037659#comment-14037659
 ] 

Valeriy Avanesov commented on SPARK-2199:
-

Here is the implementation we currently have. https://github.com/akopich/dplsa
Robust and non robust PLSA are implemented but no regularizers are currently 
supported. 

 Distributed probabilistic latent semantic analysis in MLlib
 ---

 Key: SPARK-2199
 URL: https://issues.apache.org/jira/browse/SPARK-2199
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Denis Turdakov
  Labels: features

 Probabilistic latent semantic analysis (PLSA) is a topic model which extracts 
 topics from text corpus. PLSA was historically a predecessor of LDA. However 
 recent research shows that modifications of PLSA sometimes performs better 
 then LDA[1]. Furthermore, the most recent paper by same authors shows that 
 there is a clear way to extend PLSA to LDA and beyond[2].
 We should implement distributed versions of PLSA. In addition it should be 
 possible  to easily add user defined regularizers or combination of them. We 
 will implement regularizers that allows
 * extract sparse topics
 * extract human interpretable topics 
 * perform semi-supervised training 
 * sort out non-topic specific terms. 
 [1] Potapenko, K. Vorontsov. 2013. Robust PLSA performs better than LDA. In 
 Proceedings of ECIR'13.
 [2] Vorontsov, Potapenko. Tutorial on Probabilistic Topic Modeling: Additive 
 Regularization for Stochastic Matrix Factorization. 
 http://www.machinelearning.ru/wiki/images/1/1f/Voron14aist.pdf 



--
This message was sent by Atlassian JIRA
(v6.2#6252)