[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux commented on SPARK-2433: - A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:47 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:50 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. see https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759 was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:52 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation, the bug is actually already fixed, see https://github.com/apache/spark/pull/463 was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation so I believe that the bug is actually already fixed. see https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.
[ https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801 ] Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:57 PM: -- A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation, the bug is actually already fixed, see https://github.com/apache/spark/pull/463 You might want to read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for a next time. was (Author: bdechoux): A Jira ticket is the first step, the second would have been to provide a diff patch or a github pull request. And you can also write a test to prove your point and make sure that the fix will stay longer. I will second Sean : 1) work with last version (1.0) 2) you report is not clear, that's why diff patch or pull request are welcomed And there is a transpose() in the current implementation, the bug is actually already fixed, see https://github.com/apache/spark/pull/463 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug. Key: SPARK-2433 URL: https://issues.apache.org/jira/browse/SPARK-2433 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 0.9.1 Environment: Any Reporter: Rahul K Bhojwani Labels: easyfix, test Original Estimate: 1h Remaining Estimate: 1h Don't have much experience with reporting errors. This is first time. If something is not clear please feel free to contact me (details given below) In the pyspark mllib library. Path : \spark-0.9.1\python\pyspark\mllib\classification.py Class: NaiveBayesModel Method: self.predict Earlier Implementation: def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) New Implementation: No:1 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x))) No:2 def predict(self, x): Return the most likely class for a data vector x return numpy.argmax(self.pi + dot(x,self.theta.T)) Explanation: No:1 is correct according to me. Don't know about No:2. Error one: The matrix self.theta is of dimension [n_classes , n_features]. while the matrix x is of dimension [1 , n_features]. Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features]. It will always give error: ValueError: matrices are not aligned In the commented example given in the classification.py, n_classes = n_features = 2. That's why no error. Both Implementation no.1 and Implementation no. 2 takes care of it. Error 2: As basic implementation of naive bayes is: P(class_n | sample) = count_feature_1 * P(feature_1 | class_n ) * count_feature_n * P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE) and taking the class with max value. That's what implementation 1 is doing. In Implementation 2: Its basically class with max value : ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * P(feature_n|class_n) * P(class_n)) Don't know if it gives the exact result. Thanks Rahul Bhojwani rahulbhojwani2...@gmail.com -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-9883) Distance to each cluster given a point
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694920#comment-14694920 ] Bertrand Dechoux commented on SPARK-9883: - A colleague of mine is working on it for MLlib. Figuring it out for the Pipelines API would be a nice next step. Distance to each cluster given a point -- Key: SPARK-9883 URL: https://issues.apache.org/jira/browse/SPARK-9883 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Bertrand Dechoux Priority: Minor Right now KMeansModel provides only a 'predict 'method which returns the index of the closest cluster. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 7:59 PM: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? was (Author: bdechoux): I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:41 PM: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking changes in order to do so? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? was (Author: bdechoux): I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:53 PM: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking changes in order to do so? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Could you, or a a committer, validate this proposal? was (Author: bdechoux): I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is : do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is : can we introduce potential API breaking changes in order to do so? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix {code} private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } {code} Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9748) Centriod typo in KMeansModel
Bertrand Dechoux created SPARK-9748: --- Summary: Centriod typo in KMeansModel Key: SPARK-9748 URL: https://issues.apache.org/jira/browse/SPARK-9748 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Priority: Trivial A minor typo (centriod - centroid). Readable variable names help every users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9748) Centriod typo in KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662269#comment-14662269 ] Bertrand Dechoux commented on SPARK-9748: - Pull request done : https://github.com/apache/spark/pull/8037 Centriod typo in KMeansModel Key: SPARK-9748 URL: https://issues.apache.org/jira/browse/SPARK-9748 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Priority: Trivial Labels: typo A minor typo (centriod - centroid). Readable variable names help every users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662248#comment-14662248 ] Bertrand Dechoux commented on SPARK-9720: - I might not understand but isn't it already the case for the master branch? https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/Identifiable.scala trait Identifiable { override def toString: String = uid } And many Identifiables have a default constructor using Identifiable.randomUID(keyword) for uid. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala Do you have specific counter examples? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-9748) Centriod typo in KMeansModel
[ https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bertrand Dechoux updated SPARK-9748: Comment: was deleted (was: Pull request done : https://github.com/apache/spark/pull/8037) Centriod typo in KMeansModel Key: SPARK-9748 URL: https://issues.apache.org/jira/browse/SPARK-9748 Project: Spark Issue Type: Task Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Priority: Trivial Labels: typo A minor typo (centriod - centroid). Readable variable names help every users. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333 ] Bertrand Dechoux commented on SPARK-9720: - I could take care of it. Here is the list (only in spark.ml) : * DecisionTreeClassificationModel * DecisionTreeRegressionModel * GBTClassificationModel * GBTRegressionModel * NaiveBayesModel * RFormula * RFormulaModel * RandomForestClassificationModel * RandomForestRegressionModel The question is do we want to enforce that identifiable types should be identifiable by their toString. It does make sense. The following question is can we introduce potential API breaking change in the API in order to do it? If the answer is yes, the easy way would be to set Identifiable.toString as final and compose it with an overridable empty suffix private[spark] trait Identifiable { /** * An immutable unique ID for the object and its derivatives. */ val uid: String def toStringSuffix: String = override final def toString: String = uid + toStringSuffix } Is there a committer that could validate this proposal? spark.ml Identifiable types should have UID in toString methods --- Key: SPARK-9720 URL: https://issues.apache.org/jira/browse/SPARK-9720 Project: Spark Issue Type: Improvement Components: ML Reporter: Joseph K. Bradley Priority: Minor Labels: starter It would be nice to print the UID (instance name) in toString methods. That's the default behavior for Identifiable, but some types override the default toString and do not print the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-9883) Distance to each cluster given a point
Bertrand Dechoux created SPARK-9883: --- Summary: Distance to each cluster given a point Key: SPARK-9883 URL: https://issues.apache.org/jira/browse/SPARK-9883 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.4.1 Reporter: Bertrand Dechoux Priority: Minor Right now KMeansModel provides only a 'predict 'method which returns the index of the closest cluster. https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9883) Distance to each cluster given a point (KMeansModel)
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bertrand Dechoux updated SPARK-9883: Summary: Distance to each cluster given a point (KMeansModel) (was: Distance to each cluster given a point) > Distance to each cluster given a point (KMeansModel) > > > Key: SPARK-9883 > URL: https://issues.apache.org/jira/browse/SPARK-9883 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Bertrand Dechoux >Priority: Minor > > Right now KMeansModel provides only a 'predict 'method which returns the > index of the closest cluster. > https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) > It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods
[ https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742161#comment-14742161 ] Bertrand Dechoux commented on SPARK-9720: - The pull request can be merged. > spark.ml Identifiable types should have UID in toString methods > --- > > Key: SPARK-9720 > URL: https://issues.apache.org/jira/browse/SPARK-9720 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Bertrand Dechoux >Priority: Minor > Labels: starter > > It would be nice to include the UID (instance name) in toString methods. > That's the default behavior for Identifiable, but some types override the > default toString and do not include the UID. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9883) Distance to each cluster given a point
[ https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908598#comment-14908598 ] Bertrand Dechoux commented on SPARK-9883: - The patch is now ready for MLlib and is waiting for a technical review. I will see about Pipelines API for the next step. > Distance to each cluster given a point > -- > > Key: SPARK-9883 > URL: https://issues.apache.org/jira/browse/SPARK-9883 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Bertrand Dechoux >Priority: Minor > > Right now KMeansModel provides only a 'predict 'method which returns the > index of the closest cluster. > https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector) > It would be nice to have a method giving the distance to all clusters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16705) Kafka Direct Stream is not experimental anymore
[ https://issues.apache.org/jira/browse/SPARK-16705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391506#comment-15391506 ] Bertrand Dechoux commented on SPARK-16705: -- See PR : https://github.com/apache/spark/pull/14343 > Kafka Direct Stream is not experimental anymore > --- > > Key: SPARK-16705 > URL: https://issues.apache.org/jira/browse/SPARK-16705 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 1.6.2 >Reporter: Bertrand Dechoux >Priority: Minor > > http://spark.apache.org/docs/latest/streaming-kafka-integration.html > {quote} > Note that this is an experimental feature introduced in Spark 1.3 for the > Scala and Java API, in Spark 1.4 for the Python API. > {quote} > The feature was indeed marked as experimental for spark 1.3 but is not > anymore. > * > https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html > * > https://spark.apache.org/docs/1.6.2/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16705) Kafka Direct Stream is not experimental anymore
Bertrand Dechoux created SPARK-16705: Summary: Kafka Direct Stream is not experimental anymore Key: SPARK-16705 URL: https://issues.apache.org/jira/browse/SPARK-16705 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 1.6.2 Reporter: Bertrand Dechoux Priority: Minor http://spark.apache.org/docs/latest/streaming-kafka-integration.html {quote} Note that this is an experimental feature introduced in Spark 1.3 for the Scala and Java API, in Spark 1.4 for the Python API. {quote} The feature was indeed marked as experimental for spark 1.3 but is not anymore. * https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html * https://spark.apache.org/docs/1.6.2/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org