[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-29 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605639#comment-14605639
 ] 

Rakesh Chalasani commented on SPARK-8587:
-

Hi Sam,

computeCost now returns  the cumulative cost over a dataset, rather than cost 
per sample, which i think this JIRA is for. Internally, predict does compute 
the distance to nearest point but return only the predicted center. So, adding 
a method that returns distances is doing the job twice and that is what is 
pointed above for Bradley. In Pipelines, on the other hand, this can handled 
more gracefully and efficiently by adding a column to the returning DF. 

If that is good for you, can you close this JIRA? I will create another one for 
adding distances to the KMeans pipeline, once that is merged. thanks.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-26 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603592#comment-14603592
 ] 

Joseph K. Bradley commented on SPARK-8587:
--

Based on feedback we've gotten about the Pipelines API, we are trying to focus 
more on it.  We will continue to support the original API, but I do think that, 
eventually, new development will happen in Pipelines.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-25 Thread Sam Stoelinga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600825#comment-14600825
 ] 

Sam Stoelinga commented on SPARK-8587:
--

I also agree that this should have the same API accross the different 
languages. There is already a function computeCost but the problem is that it 
doesn't return the index, the problem with predict is that it only returns the 
index and not the cost.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599006#comment-14599006
 ] 

Apache Spark commented on SPARK-8587:
-

User 'samos123' has created a pull request for this issue:
https://github.com/apache/spark/pull/6979

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-24 Thread Sam Stoelinga (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599007#comment-14599007
 ] 

Sam Stoelinga commented on SPARK-8587:
--

Implemented code example for PySpark: https://github.com/apache/spark/pull/6979 
feel free to discard this pull request for a proper implementation in Scala and 
Java also.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600237#comment-14600237
 ] 

Joseph K. Bradley commented on SPARK-8587:
--

I agree; we should not change the behavior of the existing function, and we 
will need to maintain matching APIs for Scala/Java and Python.  I think this 
will be easily supported within the Pipelines API, where KMeans is currently 
being added: [SPARK-7879].  The initial PR will add only a prediction column 
(predicted cluster), but a follow-up could add a column of costs or of soft/raw 
predictions (which could be 1/cost).

Would you be able to help out with this extension of Pipelines, once the 
initial PR gets in?  If so, we could close this JIRA and PR for now.  Thanks!

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-24 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600568#comment-14600568
 ] 

Rakesh Chalasani commented on SPARK-8587:
-

Sure, I can add this on the KMeans pipelines, whenever thats get added ( I will 
watch out for it).

On a slightly different topic that can help in our own development, since we 
are more inclined here to add these features to ML Pipelines over MLlib, 
eventually will MLlib won't be supported and future development going to happen 
more on Pipeline API alone? Thanks.



 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict

2015-06-24 Thread Rakesh Chalasani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600079#comment-14600079
 ] 

Rakesh Chalasani commented on SPARK-8587:
-

+1 for this. 

But we can't do what you did in the above PR for Java/ Scala. Its better to 
have a different function, computeDistance. I will send a different PR for 
that.

 Return cost and cluster index KMeansModel.predict
 -

 Key: SPARK-8587
 URL: https://issues.apache.org/jira/browse/SPARK-8587
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Sam Stoelinga
Priority: Minor

 Looking at PySpark the implementation of KMeansModel.predict 
 https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102
  : 
 Currently:
 it calculates the cost of the closest cluster and returns the index only.
 My expectation:
 Easy way to let the same function or a new function to return the cost with 
 the index.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org