[jira] [Commented] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

2014-07-10 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux commented on SPARK-2433:
-

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

2014-07-10 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:47 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.


was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

2014-07-10 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:50 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.

see 
https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759


was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

2014-07-10 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:52 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation,  the bug is actually 
already fixed, see https://github.com/apache/spark/pull/463



was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation so I believe that the 
bug is actually already fixed.

see 
https://github.com/apache/spark/commit/4f2f093c5b65b74869068d5690a4d2b0e0b5f759

 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2433) In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an implementation bug.

2014-07-10 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14057801#comment-14057801
 ] 

Bertrand Dechoux edited comment on SPARK-2433 at 7/10/14 6:57 PM:
--

A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation,  the bug is actually 
already fixed, see https://github.com/apache/spark/pull/463

You might want to read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark for a 
next time.


was (Author: bdechoux):
A Jira ticket is the first step, the second would have been to provide a diff 
patch or a github pull request. And you can also write a test to prove your 
point and make sure that the fix will stay longer.

I will second Sean :
1) work with last version (1.0)
2) you report is not clear, that's why diff patch or pull request are welcomed

And there is a transpose() in the current implementation,  the bug is actually 
already fixed, see https://github.com/apache/spark/pull/463


 In MLlib, implementation for Naive Bayes in Spark 0.9.1 is having an 
 implementation bug.
 

 Key: SPARK-2433
 URL: https://issues.apache.org/jira/browse/SPARK-2433
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 0.9.1
 Environment: Any 
Reporter: Rahul K Bhojwani
  Labels: easyfix, test
   Original Estimate: 1h
  Remaining Estimate: 1h

 Don't have much experience with reporting errors. This is first time. If 
 something is not clear please feel free to contact me (details given below)
 In the pyspark mllib library. 
 Path : \spark-0.9.1\python\pyspark\mllib\classification.py
 Class: NaiveBayesModel
 Method:  self.predict
 Earlier Implementation:
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 
 New Implementation:
 No:1
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + numpy.log(dot(numpy.exp(self.theta),x)))
 No:2
 def predict(self, x):
 Return the most likely class for a data vector x
 return numpy.argmax(self.pi + dot(x,self.theta.T))
 Explanation:
 No:1 is correct according to me. Don't know about No:2.
 Error one:
 The matrix self.theta is of dimension [n_classes , n_features]. 
 while the matrix x is of dimension [1 , n_features].
 Taking the dot will not work as its [1, n_feature ] x [n_classes,n_features].
 It will always give error:  ValueError: matrices are not aligned
 In the commented example given in the classification.py, n_classes = 
 n_features = 2. That's why no error.
 Both Implementation no.1 and Implementation no. 2 takes care of it.
 Error 2:
 As basic implementation of naive bayes is: P(class_n | sample) = 
 count_feature_1 * P(feature_1 | class_n ) * count_feature_n * 
 P(feature_n|class_n) * P(class_n)/(THE CONSTANT P(SAMPLE)
 and taking the class with max value.
 That's what implementation 1 is doing.
 In Implementation 2: 
 Its basically class with max value :
 ( exp(count_feature_1) * P(feature_1 | class_n ) * exp(count_feature_n) * 
 P(feature_n|class_n) * P(class_n))
 Don't know if it gives the exact result.
 Thanks
 Rahul Bhojwani
 rahulbhojwani2...@gmail.com



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-9883) Distance to each cluster given a point

2015-08-13 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14694920#comment-14694920
 ] 

Bertrand Dechoux commented on SPARK-9883:
-

A colleague of mine is working on it for MLlib. Figuring it out for the 
Pipelines API would be a nice next step.

 Distance to each cluster given a point
 --

 Key: SPARK-9883
 URL: https://issues.apache.org/jira/browse/SPARK-9883
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Bertrand Dechoux
Priority: Minor

 Right now KMeansModel provides only a 'predict 'method which returns the 
 index of the closest cluster.
 https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
 It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 7:59 PM:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?


was (Author: bdechoux):
I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is do we want to enforce that identifiable types should be 
identifiable by their toString.
It does make sense. The following question is can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:41 PM:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking changes in order to do so?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?


was (Author: bdechoux):
I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux edited comment on SPARK-9720 at 8/7/15 8:53 PM:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking changes in order to do so?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Could you, or a a committer, validate this proposal?


was (Author: bdechoux):
I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is : do we want to enforce that identifiable types should be 
identifiable by their toString.

It does make sense. The following question is : can we introduce potential API 
breaking changes in order to do so?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

{code}
private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}
{code}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9748) Centriod typo in KMeansModel

2015-08-07 Thread Bertrand Dechoux (JIRA)
Bertrand Dechoux created SPARK-9748:
---

 Summary: Centriod typo in KMeansModel
 Key: SPARK-9748
 URL: https://issues.apache.org/jira/browse/SPARK-9748
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Priority: Trivial


A minor typo (centriod - centroid). Readable variable names help every users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9748) Centriod typo in KMeansModel

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662269#comment-14662269
 ] 

Bertrand Dechoux commented on SPARK-9748:
-

Pull request done : https://github.com/apache/spark/pull/8037

 Centriod typo in KMeansModel
 

 Key: SPARK-9748
 URL: https://issues.apache.org/jira/browse/SPARK-9748
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Priority: Trivial
  Labels: typo

 A minor typo (centriod - centroid). Readable variable names help every users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662248#comment-14662248
 ] 

Bertrand Dechoux commented on SPARK-9720:
-

I might not understand but isn't it already the case for the master branch?

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/util/Identifiable.scala
trait Identifiable {
  override def toString: String = uid
}

And many Identifiables have a default constructor using 
Identifiable.randomUID(keyword) for uid.
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/BinaryClassificationEvaluator.scala

Do you have specific counter examples?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9748) Centriod typo in KMeansModel

2015-08-07 Thread Bertrand Dechoux (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bertrand Dechoux updated SPARK-9748:

Comment: was deleted

(was: Pull request done : https://github.com/apache/spark/pull/8037)

 Centriod typo in KMeansModel
 

 Key: SPARK-9748
 URL: https://issues.apache.org/jira/browse/SPARK-9748
 Project: Spark
  Issue Type: Task
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Priority: Trivial
  Labels: typo

 A minor typo (centriod - centroid). Readable variable names help every users.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-08-07 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662333#comment-14662333
 ] 

Bertrand Dechoux commented on SPARK-9720:
-

I could take care of it.

Here is the list (only in spark.ml) :
* DecisionTreeClassificationModel
* DecisionTreeRegressionModel
* GBTClassificationModel
* GBTRegressionModel
* NaiveBayesModel
* RFormula
* RFormulaModel
* RandomForestClassificationModel
* RandomForestRegressionModel

The question is do we want to enforce that identifiable types should be 
identifiable by their toString.
It does make sense. The following question is can we introduce potential API 
breaking change in the API in order to do it?

If the answer is yes, the easy way would be to set Identifiable.toString as 
final and compose it with an overridable empty suffix

private[spark] trait Identifiable {

  /**
   * An immutable unique ID for the object and its derivatives.
   */
  val uid: String
  
  def toStringSuffix: String = 

  override final def toString: String = uid + toStringSuffix
}

Is there a committer that could validate this proposal?

 spark.ml Identifiable types should have UID in toString methods
 ---

 Key: SPARK-9720
 URL: https://issues.apache.org/jira/browse/SPARK-9720
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Priority: Minor
  Labels: starter

 It would be nice to print the UID (instance name) in toString methods.  
 That's the default behavior for Identifiable, but some types override the 
 default toString and do not print the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9883) Distance to each cluster given a point

2015-08-12 Thread Bertrand Dechoux (JIRA)
Bertrand Dechoux created SPARK-9883:
---

 Summary: Distance to each cluster given a point
 Key: SPARK-9883
 URL: https://issues.apache.org/jira/browse/SPARK-9883
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.4.1
Reporter: Bertrand Dechoux
Priority: Minor


Right now KMeansModel provides only a 'predict 'method which returns the index 
of the closest cluster.

https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)

It would be nice to have a method giving the distance to all clusters.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9883) Distance to each cluster given a point (KMeansModel)

2015-10-26 Thread Bertrand Dechoux (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bertrand Dechoux updated SPARK-9883:

Summary: Distance to each cluster given a point (KMeansModel)  (was: 
Distance to each cluster given a point)

> Distance to each cluster given a point (KMeansModel)
> 
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9720) spark.ml Identifiable types should have UID in toString methods

2015-09-12 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742161#comment-14742161
 ] 

Bertrand Dechoux commented on SPARK-9720:
-

The pull request can be merged.

> spark.ml Identifiable types should have UID in toString methods
> ---
>
> Key: SPARK-9720
> URL: https://issues.apache.org/jira/browse/SPARK-9720
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Bertrand Dechoux
>Priority: Minor
>  Labels: starter
>
> It would be nice to include the UID (instance name) in toString methods.  
> That's the default behavior for Identifiable, but some types override the 
> default toString and do not include the UID.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9883) Distance to each cluster given a point

2015-09-25 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908598#comment-14908598
 ] 

Bertrand Dechoux commented on SPARK-9883:
-

The patch is now ready for MLlib and is waiting for a technical review.
I will see about Pipelines API for the next step.

> Distance to each cluster given a point
> --
>
> Key: SPARK-9883
> URL: https://issues.apache.org/jira/browse/SPARK-9883
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> Right now KMeansModel provides only a 'predict 'method which returns the 
> index of the closest cluster.
> https://spark.apache.org/docs/latest/api/java/org/apache/spark/mllib/clustering/KMeansModel.html#predict(org.apache.spark.mllib.linalg.Vector)
> It would be nice to have a method giving the distance to all clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16705) Kafka Direct Stream is not experimental anymore

2016-07-25 Thread Bertrand Dechoux (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15391506#comment-15391506
 ] 

Bertrand Dechoux commented on SPARK-16705:
--

See PR : https://github.com/apache/spark/pull/14343

> Kafka Direct Stream is not experimental anymore
> ---
>
> Key: SPARK-16705
> URL: https://issues.apache.org/jira/browse/SPARK-16705
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.6.2
>Reporter: Bertrand Dechoux
>Priority: Minor
>
> http://spark.apache.org/docs/latest/streaming-kafka-integration.html
> {quote}
> Note that this is an experimental feature introduced in Spark 1.3 for the 
> Scala and Java API, in Spark 1.4 for the Python API.
> {quote}
> The feature was indeed marked as experimental for spark 1.3 but is not 
> anymore.
> * 
> https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html
> * 
> https://spark.apache.org/docs/1.6.2/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16705) Kafka Direct Stream is not experimental anymore

2016-07-25 Thread Bertrand Dechoux (JIRA)
Bertrand Dechoux created SPARK-16705:


 Summary: Kafka Direct Stream is not experimental anymore
 Key: SPARK-16705
 URL: https://issues.apache.org/jira/browse/SPARK-16705
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 1.6.2
Reporter: Bertrand Dechoux
Priority: Minor


http://spark.apache.org/docs/latest/streaming-kafka-integration.html

{quote}
Note that this is an experimental feature introduced in Spark 1.3 for the Scala 
and Java API, in Spark 1.4 for the Python API.
{quote}

The feature was indeed marked as experimental for spark 1.3 but is not anymore.

* 
https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html
* 
https://spark.apache.org/docs/1.6.2/api/java/index.html?org/apache/spark/streaming/kafka/KafkaUtils.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org