[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-05 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502733#comment-16502733
 ] 

Miao Wang commented on SPARK-15784:
---

[~WeichenXu123] Thank you very much! 

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501017#comment-16501017
 ] 

Apache Spark commented on SPARK-15784:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21493

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Weichen Xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16501020#comment-16501020
 ] 

Weichen Xu commented on SPARK-15784:


[~wm624] Thanks for your enthusiasm, but we need this to be done ASAP, so I 
create a PR.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-04 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16500809#comment-16500809
 ] 

Xiangrui Meng commented on SPARK-15784:
---

Discussed with [~WeichenXu123] offline. I think we should change the APIs to 
the following:

{code}
class PowerIterationClustering extends Params with HasWeightCol with 
DefaultReadWrite {
  def srcCol: Param[String]
  def dstCol: Param[String]
  def wegithCol: Param[String]
  def assignClusters(dataset: Dataset[_]): DataFrame[id: Long, cluster: Int]
}
{code}

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-06-02 Thread Miao Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16499143#comment-16499143
 ] 

Miao Wang commented on SPARK-15784:
---

[~josephkb] Just saw your comments. Let me try fix it. I am on travel now and 
return to US in mid June. I will try to work on it. Otherwise, I will let 
[~shahid] know. Thanks!

 

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478328#comment-16478328
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

[~shahid] Thanks for offering!  If [~wm624] wants to (and has time to) take 
this, then I'd suggest that.  But if not, then please go ahead, thanks!

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-11 Thread shahid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471545#comment-16471545
 ] 

shahid commented on SPARK-15784:


Hi [~josephkb] , I can work on it.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility in 
spark.ml.stat.  We can have it inherit from Params but not make it a 
Transformer.

How does this sound?

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-17 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441448#comment-16441448
 ] 

Apache Spark commented on SPARK-15784:
--

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/21090

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-17 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441353#comment-16441353
 ] 

Miao Wang commented on SPARK-15784:
---

[~josephkb] You can start the new PR now. :)

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-03 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424609#comment-16424609
 ] 

Yanbo Liang commented on SPARK-15784:
-

[~josephkb] Please take over this, I'm very busy recently and don't have time 
to shepherd this. Thanks very much.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-03 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16424592#comment-16424592
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

[~yanboliang] Would you like for me to take over shepherding this?  I have 
bandwidth now.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2017-04-27 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15987588#comment-15987588
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

Retargeting since 2.2 has been cut

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-04 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637189#comment-15637189
 ] 

Miao Wang commented on SPARK-15784:
---

I created a new PR to implement PIC as a Transformer.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637132#comment-15637132
 ] 

Apache Spark commented on SPARK-15784:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/15770

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-02 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15628347#comment-15628347
 ] 

Yanbo Liang commented on SPARK-15784:
-

I'm prefer to #1 and #3, but it looks like we can achieve both goals. 
Graph can be represented by GraphX/GraphFrame or DataFrame/RDD. PIC model can 
be trained on both of them, but we use GraphX operators in the internal 
implementation which means input data should be converted to GraphX 
representation if it's RDD of tuples. So it's straight forward to make PIC as 
one of the algorithms in GraphX(or GraphFrame when it is merged back into 
Spark). However, users may load their graph as DataFrame/RDD and transform via 
ML Pipeline which should also be supported, so it's better we can wrapper PIC 
of GraphX/GraphFrame as an Pipeline stage and then ML users can use it as well. 
For some historical reasons, I propose to split this task into the following 
step:
* Put PIC in Pipeline as a Transformer, use the GraphX operators in the 
implementation (This is consistent with [~josephkb]'s proposal).
* Add PIC algorithms to GraphFrames when it is merged into Spark.
* Make the ML PIC as a wrapper to call the GraphFrames PIC implementation.

I think this scenario should be better for different users(ML users and 
GraphFrames users), but still open to hear your thoughts. Thanks.




> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-01 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15627045#comment-15627045
 ] 

Miao Wang commented on SPARK-15784:
---

[~josephkb] I am good for the Transformer approach too. I will start revising 
the code if [~yanboliang] and [~mlnick] have no comments. Now, I am creating a 
performance testing application for structured streaming. So I target finish 
PIC within 2 weeks. Thanks!  

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626692#comment-15626692
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

I'm all for the Transformer approach.  If that sounds good to you, then I think 
you can reuse most of your code.

Btw, I'm told 2.1's RC1 is being cut soon, so I'm going to retarget this for 
2.2.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-11-01 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15626169#comment-15626169
 ] 

Miao Wang commented on SPARK-15784:
---

Just closed the PR. Let us continue the design here and I will re-work on it 
once we agree on the design. Thanks! I will update design doc according to our 
discussion and my initial PR.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-10-31 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623661#comment-15623661
 ] 

Seth Hendrickson commented on SPARK-15784:
--

This seems like it fits the framework of a feature transformer. We could 
generate a real-valued feature column using PIC algorithm where the values are 
just the components of the pseudo-eigenvector. Alternatively we could pipeline 
a KMeans clustering on the end, but I think it makes more sense to let users do 
that themselves - but that's up for debate.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-10-31 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15622915#comment-15622915
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

[~wangmiao1981] Sorry for the slow response here.  I do want us to add PIC to 
spark.ml, but we should discuss the design before the PR.  Could you please 
close the PR for now but save the branch to re-open after discussion?

Let's have a design discussion first.

I agree that the big issue is that there isn't a clear way to make predictions 
on new data points.  In fact, I've never heard of people trying to do so.  Has 
anyone else?

Assuming that prediction is not meaningful for PIC, then I don't think the 
algorithm fits within the Pipeline framework, though it's debatable.  I see a 
few options:
* Put PIC in Pipelines as a Transformer, not an Estimator.  We would just need 
to document that it is a very expensive Transformer.
* Put PIC in spark.ml as a static method.  We may have to do this anyways to 
support all of spark.mllib's Statistics.
* Put PIC in GraphFrames (and push harder for GraphFrames to be merged back 
into Spark, which will include a much longer set of improvements).

My top choice is PIC as a Transformer.  What do you think?

CC [~yanboliang] [~sethah] [~mlnick] opinions?

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-15 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15332185#comment-15332185
 ] 

Miao Wang commented on SPARK-15784:
---

[~josephkb][~mengxr][~yanboliang] I am trying to add PIC to spark.ml and I have 
some questions regarding model.predict and saveImpl. The basic PIC algorithm 
has the following steps:

Input: A row-normalized affinity matrix W and the number of clusters k
Output: Clusters C1, C2, …, Ck

Pick an initial vector v0
Repeat
Set vt+1 ← Wvt
Set δt+1 ← |vt+1 – vt|
Increment t
Stop when |δt – δt-1| ≈ 0
Use k-means to cluster points on vt and return clusters C1, C2, …, Ck

In the last step, k-means takes the pseudo-eigenvector `v ` generated by PIC to 
do the classification. Therefore, the model.predict should use the trained 
k-means to do the prediction. However, the vector `v` should run PIC again on 
the data to be predicted. So, there is no trained model for predicting new data 
set. model.predict is actually training again using the PIC.fit method. In this 
case, PIC.fit and PIC.predict actually call the same run method in MLLib 
implementation. 

Since we have to train data anyway, the model save is not useful as there is no 
model to be save. In the MLLib implementation, save function saves the 
assignment results of the current data set, which can't be used for new data 
clustering. The only usage of the result is when the same data is given, we 
don't have to train again. However, we don't know whether it is the previous 
training data from the saved model.

Please correct me if I misunderstand anything. Thanks!

Miao




> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328061#comment-15328061
 ] 

Apache Spark commented on SPARK-15784:
--

User 'wangmiao1981' has created a pull request for this issue:
https://github.com/apache/spark/pull/13647

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2016-06-08 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15320879#comment-15320879
 ] 

Miao Wang commented on SPARK-15784:
---

I can work on this. Thanks!

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Xinh Huynh
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org