[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-02-23 Thread Thang Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15159603#comment-15159603
 ] 

Thang Nguyen commented on FLINK-1733:
-

Hey folks, just wanted to update you on my progress:

I'm still trying to work through an SPCA implementation, but I was hoping to 
get some preliminary feedback. 
Instead of showing something half implemented, I decided to clean up the naive 
PCA implementation. You can see the code 
[here|https://github.com/nguyent/flink/commit/8f198edcf26c6a98f8c7cdb7c30ef96632ec6f8c]

Any and all feedback is welcome, especially around code organization and 
general style. 

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-02-04 Thread Thang Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133521#comment-15133521
 ] 

Thang Nguyen commented on FLINK-1733:
-

| I'm not sure that {{DenseMatrix}} fits for sPCA

What if the Matrix is relatively small? 

>From the paper, where d is the # of principal components:

| matrix C, which is of size D × d (recall that d is typically small). For 
example, in our experiments with a 94 GB dataset, the size of matrix C was 30 
MB, which can easily fit in memory.

This matrix C is broadcasted to the workers and is used to redundantly 
recompute an intermediate matrix (in favor of cutting down communication 
complexity). The distributed algorithm also only requires accessing a single 
row at a time to compute a partial result, and then sums the partials at the 
end. 

Is the lack of a distributed matrix/vector implementation enough of a blocker 
to be worried, or should I continue?

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-02-04 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133568#comment-15133568
 ] 

Chiwan Park commented on FLINK-1733:


Oh, you can continue work for it, [~thang]. Sorry for confusing.

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-02-04 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133465#comment-15133465
 ] 

Chiwan Park commented on FLINK-1733:


Hi [~thang],

You can use {{breeze.linalg.DenseMatrix}}. But you have to convert it to 
Flink`s {{DenseMatrix}} at end of computation. I recommend to implement a 
implicit conversion method between breeze`s {{DenseMatrix}} and Flink's 
{{DenseMatrix}} You can find the conversion implementation for Flink's 
{{DenseVector}} in {{DenseVector.scala}}.

But I'm not sure that {{DenseMatrix}} fits for sPCA. To achieve scalability, we 
need distributed matrix and vector implementation. Currently there is no 
implementation for distributed matrix and vector implementation in FlinkML. 
(https://issues.apache.org/jira/browse/FLINK-1873)

The distribution status of DataSet depends on the source of data. If the data 
are from distributed file system, the data are well distributed by the file 
system and Flink also uses the status of distribution. In typical case, you 
don't need to care distribution of the data.

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-02-04 Thread Thang Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133447#comment-15133447
 ] 

Thang Nguyen commented on FLINK-1733:
-

Thanks [~till.rohrmann], I've been flipping through the Odersky book and it is 
indeed an excellent resource. 

I have some questions that may be obvious, but their answer seems to elude me 
for whatever reason...

For context, I have read the sPCA paper a few times and have the Spark 
implementation of sPCA running locally with a remote debugger hooked up to 
validate my incremental work.

- Is it fine to use {{breeze.linalg.DenseMatrix}} for this sPCA? Matrix 
multiplication with {{flink.ml.math.DenseMatrix}} doesn't seem to be 
implemented as far as I can tell.

- How are DataSets partitioned across nodes, when there isn't a key explicitly 
specified? Are they evenly distributed based on the size of the DataSet? 

- How does parallel execution on an arbitrarily large DataSet happen from a 
code perspective? Does the optimizer take care of most of the heavy lifting as 
long as the code is written in a functional manner? (Asking specifically about 
the FNormJob/YtXJob in the paper). I am aware of the plan visualizer, however I 
haven't gotten to that point just yet...


> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-01-26 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116993#comment-15116993
 ] 

Till Rohrmann commented on FLINK-1733:
--

Hi [~thang], I think for a first version your interface definition sounds fine. 
The user provides the number of principal components he wants to obtain and he 
receives a {{DataSet[Vector]}} or {{DataSet[DenseVector]}} which are the 
principal components.

Your description of the standard PCA is also correct. However, I think for the 
distributed execution it might be a bit different. Best you check out the 
linked resources or google for papers describing a distributed PCA 
implementation on MapReduce.

Be aware that if you want to order the vectors contained in the resulting 
{{DataSet}} you have to give them IDs or assign them their eigenvalues because 
the {{DataSet}} does not allow you to store the data in order.

If you're new to Scala, then I can recommend you reading 
http://www.artima.com/pins1ed/. It's a good book even though it is getting a 
bit long in the tooth.


> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-01-22 Thread Thang Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15113112#comment-15113112
 ] 

Thang Nguyen commented on FLINK-1733:
-

Hi [~till.rohrmann]! 

I am a software engineer professionally, however I am new to Scala. I did learn 
some functional programming in undergrad, so the trickiest thing for me to wrap 
my head around is Scala's type system. 

For context: 
I have a naive PCA implementation and some trivial tests for it (using the 
method & test data from this paper: 
http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf). 

Currently, the method accepts an Int (# of princ. components), and a 
DataSet[DenseVector].

For this implementation, I create a covariance matrix (BreezeMatrix) and call 
breeze.linalg.eigSym on that. 
Then I return the top N (user param) principal components as a DataSet[Vector].
I will be re-factoring/throwing out a lot of my code (except the tests), so I 
hesitate to show anything I've written just yet.

Questions:
Does the method signature make sense? 
What _exactly_ should I be returning?  The concept of PCA is new to me but it 
sounds like I should be returning the top N vectors (based on their 
eigenvalues, ordered by significance). 
Should the output also be DataSet[DenseVector]?
Pointers on how to implement sPCA? 

I have taken a cursory look at the rest of the ML library but I am still 
learning Scala. 
If you have any recommended resources on learning Scala (specifically the type 
system), I would also appreciate that. 

Thanks! 

Thang

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-01-15 Thread Till Rohrmann (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101580#comment-15101580
 ] 

Till Rohrmann commented on FLINK-1733:
--

Hi [~thang], welcome to the Flink community :-) Great to hear that you want to 
pick up the issue. I've assigned the JIRA to you. If you have any questions 
then don't hesitate to ask us :-) 

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Thang Nguyen
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-01-14 Thread Chiwan Park (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15099236#comment-15099236
 ] 

Chiwan Park commented on FLINK-1733:


Hi [~thang], welcome to Flink community. Currently, you are not in contributors 
group of Flink. Maybe some committers which have the permission will be assign 
this issue to you. I think you can start on this issue now. Assigning will be 
done in few days. :)

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2016-01-14 Thread Thang Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098786#comment-15098786
 ] 

Thang Nguyen commented on FLINK-1733:
-

Hi [~till.rohrmann], I'm currently an Insight Data Engineering fellow and I'm 
interested in taking over this ticket as my project. 

Would it be possible for me to get assigned to this? (Assuming no one else is 
working on it at the moment.)

> Add PCA to machine learning library
> ---
>
> Key: FLINK-1733
> URL: https://issues.apache.org/jira/browse/FLINK-1733
> Project: Flink
>  Issue Type: New Feature
>  Components: Machine Learning Library
>Reporter: Till Rohrmann
>Assignee: Raghav Chalapathy
>Priority: Minor
>  Labels: ML
>
> Dimension reduction is a crucial prerequisite for many data analysis tasks. 
> Therefore, Flink's machine learning library should contain a principal 
> components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
> proposes a distributed PCA. A more recent publication [2] describes another 
> scalable PCA implementation.
> Resources:
> [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
> [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (FLINK-1733) Add PCA to machine learning library

2015-05-19 Thread Raghav Chalapathy (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551753#comment-14551753
 ] 

Raghav Chalapathy commented on FLINK-1733:
--

Hi Till 
 Let me go through the paper shall present my analysis about the same
with regards
Raghav


 Add PCA to machine learning library
 ---

 Key: FLINK-1733
 URL: https://issues.apache.org/jira/browse/FLINK-1733
 Project: Flink
  Issue Type: New Feature
  Components: Machine Learning Library
Reporter: Till Rohrmann
Assignee: Raghav Chalapathy
Priority: Minor
  Labels: ML

 Dimension reduction is a crucial prerequisite for many data analysis tasks. 
 Therefore, Flink's machine learning library should contain a principal 
 components analysis (PCA) implementation. Maria-Florina Balcan et al. [1] 
 proposes a distributed PCA. A more recent publication [2] describes another 
 scalable PCA implementation.
 Resources:
 [1] [http://arxiv.org/pdf/1408.5823v5.pdf]
 [2] [http://ds.qcri.org/images/profile/tarek_elgamal/sigmod2015.pdf]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)