[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605639#comment-14605639 ] Rakesh Chalasani commented on SPARK-8587: - Hi Sam, computeCost now returns the cumulative cost over a dataset, rather than cost per sample, which i think this JIRA is for. Internally, predict does compute the distance to nearest point but return only the predicted center. So, adding a method that returns distances is doing the job twice and that is what is pointed above for Bradley. In Pipelines, on the other hand, this can handled more gracefully and efficiently by adding a column to the returning DF. If that is good for you, can you close this JIRA? I will create another one for adding distances to the KMeans pipeline, once that is merged. thanks. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603592#comment-14603592 ] Joseph K. Bradley commented on SPARK-8587: -- Based on feedback we've gotten about the Pipelines API, we are trying to focus more on it. We will continue to support the original API, but I do think that, eventually, new development will happen in Pipelines. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600825#comment-14600825 ] Sam Stoelinga commented on SPARK-8587: -- I also agree that this should have the same API accross the different languages. There is already a function computeCost but the problem is that it doesn't return the index, the problem with predict is that it only returns the index and not the cost. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599006#comment-14599006 ] Apache Spark commented on SPARK-8587: - User 'samos123' has created a pull request for this issue: https://github.com/apache/spark/pull/6979 Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14599007#comment-14599007 ] Sam Stoelinga commented on SPARK-8587: -- Implemented code example for PySpark: https://github.com/apache/spark/pull/6979 feel free to discard this pull request for a proper implementation in Scala and Java also. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600237#comment-14600237 ] Joseph K. Bradley commented on SPARK-8587: -- I agree; we should not change the behavior of the existing function, and we will need to maintain matching APIs for Scala/Java and Python. I think this will be easily supported within the Pipelines API, where KMeans is currently being added: [SPARK-7879]. The initial PR will add only a prediction column (predicted cluster), but a follow-up could add a column of costs or of soft/raw predictions (which could be 1/cost). Would you be able to help out with this extension of Pipelines, once the initial PR gets in? If so, we could close this JIRA and PR for now. Thanks! Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600568#comment-14600568 ] Rakesh Chalasani commented on SPARK-8587: - Sure, I can add this on the KMeans pipelines, whenever thats get added ( I will watch out for it). On a slightly different topic that can help in our own development, since we are more inclined here to add these features to ML Pipelines over MLlib, eventually will MLlib won't be supported and future development going to happen more on Pipeline API alone? Thanks. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8587) Return cost and cluster index KMeansModel.predict
[ https://issues.apache.org/jira/browse/SPARK-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600079#comment-14600079 ] Rakesh Chalasani commented on SPARK-8587: - +1 for this. But we can't do what you did in the above PR for Java/ Scala. Its better to have a different function, computeDistance. I will send a different PR for that. Return cost and cluster index KMeansModel.predict - Key: SPARK-8587 URL: https://issues.apache.org/jira/browse/SPARK-8587 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Sam Stoelinga Priority: Minor Looking at PySpark the implementation of KMeansModel.predict https://github.com/apache/spark/blob/master/python/pyspark/mllib/clustering.py#L102 : Currently: it calculates the cost of the closest cluster and returns the index only. My expectation: Easy way to let the same function or a new function to return the cost with the index. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org