[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159603#comment-15159603 ] Thang Nguyen commented on FLINK-1733: - Hey folks, just wanted to update you on my progress: I'm still trying to work through an SPCA implementation, but I was hoping to get some preliminary feedback. Instead of showing something half implemented, I decided to clean up the naive PCA implementation. You can see the code [here|https://github.com/nguyent/flink/commit/8f198edcf26c6a98f8c7cdb7c30ef96632ec6f8c] Any and all feedback is welcome, especially around code organization and general style. > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133521#comment-15133521 ] Thang Nguyen commented on FLINK-1733: - | I'm not sure that {{DenseMatrix}} fits for sPCA What if the Matrix is relatively small? >From the paper, where d is the # of principal components: | matrix C, which is of size D × d (recall that d is typically small). For example, in our experiments with a 94 GB dataset, the size of matrix C was 30 MB, which can easily fit in memory. This matrix C is broadcasted to the workers and is used to redundantly recompute an intermediate matrix (in favor of cutting down communication complexity). The distributed algorithm also only requires accessing a single row at a time to compute a partial result, and then sums the partials at the end. Is the lack of a distributed matrix/vector implementation enough of a blocker to be worried, or should I continue? > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133568#comment-15133568 ] Chiwan Park commented on FLINK-1733: Oh, you can continue work for it, [~thang]. Sorry for confusing. > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133465#comment-15133465 ] Chiwan Park commented on FLINK-1733: Hi [~thang], You can use {{breeze.linalg.DenseMatrix}}. But you have to convert it to Flink`s {{DenseMatrix}} at end of computation. I recommend to implement a implicit conversion method between breeze`s {{DenseMatrix}} and Flink's {{DenseMatrix}} You can find the conversion implementation for Flink's {{DenseVector}} in {{DenseVector.scala}}. But I'm not sure that {{DenseMatrix}} fits for sPCA. To achieve scalability, we need distributed matrix and vector implementation. Currently there is no implementation for distributed matrix and vector implementation in FlinkML. (https://issues.apache.org/jira/browse/FLINK-1873) The distribution status of DataSet depends on the source of data. If the data are from distributed file system, the data are well distributed by the file system and Flink also uses the status of distribution. In typical case, you don't need to care distribution of the data. > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133447#comment-15133447 ] Thang Nguyen commented on FLINK-1733: - Thanks [~till.rohrmann], I've been flipping through the Odersky book and it is indeed an excellent resource. I have some questions that may be obvious, but their answer seems to elude me for whatever reason... For context, I have read the sPCA paper a few times and have the Spark implementation of sPCA running locally with a remote debugger hooked up to validate my incremental work. - Is it fine to use {{breeze.linalg.DenseMatrix}} for this sPCA? Matrix multiplication with {{flink.ml.math.DenseMatrix}} doesn't seem to be implemented as far as I can tell. - How are DataSets partitioned across nodes, when there isn't a key explicitly specified? Are they evenly distributed based on the size of the DataSet? - How does parallel execution on an arbitrarily large DataSet happen from a code perspective? Does the optimizer take care of most of the heavy lifting as long as the code is written in a functional manner? (Asking specifically about the FNormJob/YtXJob in the paper). I am aware of the plan visualizer, however I haven't gotten to that point just yet... > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116993#comment-15116993 ] Till Rohrmann commented on FLINK-1733: -- Hi [~thang], I think for a first version your interface definition sounds fine. The user provides the number of principal components he wants to obtain and he receives a {{DataSet[Vector]}} or {{DataSet[DenseVector]}} which are the principal components. Your description of the standard PCA is also correct. However, I think for the distributed execution it might be a bit different. Best you check out the linked resources or google for papers describing a distributed PCA implementation on MapReduce. Be aware that if you want to order the vectors contained in the resulting {{DataSet}} you have to give them IDs or assign them their eigenvalues because the {{DataSet}} does not allow you to store the data in order. If you're new to Scala, then I can recommend you reading http://www.artima.com/pins1ed/. It's a good book even though it is getting a bit long in the tooth. > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113112#comment-15113112 ] Thang Nguyen commented on FLINK-1733: - Hi [~till.rohrmann]! I am a software engineer professionally, however I am new to Scala. I did learn some functional programming in undergrad, so the trickiest thing for me to wrap my head around is Scala's type system. For context: I have a naive PCA implementation and some trivial tests for it (using the method & test data from this paper: http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). Currently, the method accepts an Int (# of princ. components), and a DataSet[DenseVector]. For this implementation, I create a covariance matrix (BreezeMatrix) and call breeze.linalg.eigSym on that. Then I return the top N (user param) principal components as a DataSet[Vector]. I will be re-factoring/throwing out a lot of my code (except the tests), so I hesitate to show anything I've written just yet. Questions: Does the method signature make sense? What _exactly_ should I be returning? The concept of PCA is new to me but it sounds like I should be returning the top N vectors (based on their eigenvalues, ordered by significance). Should the output also be DataSet[DenseVector]? Pointers on how to implement sPCA? I have taken a cursory look at the rest of the ML library but I am still learning Scala. If you have any recommended resources on learning Scala (specifically the type system), I would also appreciate that. Thanks! Thang > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101580#comment-15101580 ] Till Rohrmann commented on FLINK-1733: -- Hi [~thang], welcome to the Flink community :-) Great to hear that you want to pick up the issue. I've assigned the JIRA to you. If you have any questions then don't hesitate to ask us :-) > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Thang Nguyen >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15099236#comment-15099236 ] Chiwan Park commented on FLINK-1733: Hi [~thang], welcome to Flink community. Currently, you are not in contributors group of Flink. Maybe some committers which have the permission will be assign this issue to you. I think you can start on this issue now. Assigning will be done in few days. :) > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Raghav Chalapathy >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098786#comment-15098786 ] Thang Nguyen commented on FLINK-1733: - Hi [~till.rohrmann], I'm currently an Insight Data Engineering fellow and I'm interested in taking over this ticket as my project. Would it be possible for me to get assigned to this? (Assuming no one else is working on it at the moment.) > Add PCA to machine learning library > --- > > Key: FLINK-1733 > URL: https://issues.apache.org/jira/browse/FLINK-1733 > Project: Flink > Issue Type: New Feature > Components: Machine Learning Library >Reporter: Till Rohrmann >Assignee: Raghav Chalapathy >Priority: Minor > Labels: ML > > Dimension reduction is a crucial prerequisite for many data analysis tasks. > Therefore, Flink's machine learning library should contain a principal > components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] > proposes a distributed PCA. A more recent publication [2] describes another > scalable PCA implementation. > Resources: > [1] [http://arxiv.org/pdf/1408.5823v5.pdf] > [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1733) Add PCA to machine learning library
[ https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551753#comment-14551753 ] Raghav Chalapathy commented on FLINK-1733: -- Hi Till Let me go through the paper shall present my analysis about the same with regards Raghav Add PCA to machine learning library --- Key: FLINK-1733 URL: https://issues.apache.org/jira/browse/FLINK-1733 Project: Flink Issue Type: New Feature Components: Machine Learning Library Reporter: Till Rohrmann Assignee: Raghav Chalapathy Priority: Minor Labels: ML Dimension reduction is a crucial prerequisite for many data analysis tasks. Therefore, Flink's machine learning library should contain a principal components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] proposes a distributed PCA. A more recent publication [2] describes another scalable PCA implementation. Resources: [1] [http://arxiv.org/pdf/1408.5823v5.pdf] [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf] -- This message was sent by Atlassian JIRA (v6.3.4#6332)