[jira] [Commented] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-21 Thread Antoine Galataud (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16551640#comment-16551640
 ] 

Antoine Galataud commented on SPARK-24875:
--

True, I was proposing this not as a replacement, but as an option (e.g 
setUseApproxStats on MulticlassMetrics) that wouldn’t be the default. 
Correctness is key, but having an approximate result is better than no result 
at all.
However there should be better solutions that using countByValueApprox. Open to 
suggestions! 

> MulticlassMetrics should offer a more efficient way to compute count by label
> -
>
> Key: SPARK-24875
> URL: https://issues.apache.org/jira/browse/SPARK-24875
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.1
>Reporter: Antoine Galataud
>Priority: Minor
>
> Currently _MulticlassMetrics_ calls _countByValue_() to get count by 
> class/label
> {code:java}
> private lazy val labelCountByClass: Map[Double, Long] = 
> predictionAndLabels.values.countByValue()
> {code}
> If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
> test dataset), it will lead to poor execution performance.
> One option could be to allow using _countByValueApprox_ (could require adding 
> an extra configuration param for MulticlassMetrics).
> Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, 
> I don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24875) MulticlassMetrics should offer a more efficient way to compute count by label

2018-07-20 Thread Antoine Galataud (JIRA)
Antoine Galataud created SPARK-24875:


 Summary: MulticlassMetrics should offer a more efficient way to 
compute count by label
 Key: SPARK-24875
 URL: https://issues.apache.org/jira/browse/SPARK-24875
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.3.1
Reporter: Antoine Galataud


Currently _MulticlassMetrics_ calls _countByValue_() to get count by class/label
{code:java}
private lazy val labelCountByClass: Map[Double, Long] = 
predictionAndLabels.values.countByValue()
{code}
If input _RDD[(Double, Double)]_ is huge (which can be the case with a large 
test dataset), it will lead to poor execution performance.

One option could be to allow using _countByValueApprox_ (could require adding 
an extra configuration param for MulticlassMetrics).

Note: since there is no equivalent of _MulticlassMetrics_ in new ML library, I 
don't know how this could be ported there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org