Hi Xiangrui & Spark People, I recently got round to writing an evaluation framework for Spark that I was hoping to PR into MLLib and this would solve some of the aforementioned issues. I have put the code on github in a separate repo for now as I would like to get some sandboxed feedback. The repo complete with detailed documentation can be found here https://github.com/samthebest/sceval.
Many thanks, Sam On Thu, Jun 18, 2015 at 11:00 AM, Sam <samthesav...@gmail.com> wrote: > Firstly apologies for the header of my email containing some junk, I > believe it's due to a copy and paste error on a smart phone. > > Thanks for your response. I will indeed make the PR you suggest, though > glancing at the code I realize it's not just a case of making these public > since the types are also private. Then, there is certain functionality I > will be exposing, which then ought to be tested, e.g. every bin except > potentially the last will have an equal number of data points in it*. I'll > get round to it at some point. > > As for BinaryClassificationMetrics using Double for labels, thanks for the > explanation. If I where to make a PR to encapsulate the underlying > implementation (that uses LabeledPoint) and change the type to Boolean, > would what be the impact to versioning (since I'd be changing public API)? > An alternative would be to create a new wrapper class, say > BinaryClassificationMeasures, and deprecate the old with the intention of > migrating all the code into the new class. > > * Maybe some other part of the code base tests this, since this assumption > must hold in order to average across folds in x-validation? > > On Thu, Jun 18, 2015 at 1:02 AM, Xiangrui Meng <men...@gmail.com> wrote: > >> LabeledPoint was used for both classification and regression, where label >> type is Double for simplicity. So in BinaryClassificationMetrics, we still >> use Double for labels. We compute the confusion matrix at each threshold >> internally, but this is not exposed to users ( >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L127). >> Feel free to submit a PR to make it public. -Xiangrui >> >> On Mon, Jun 15, 2015 at 7:13 AM, Sam <samthesav...@gmail.com> wrote: >> >>> >>> Google+ >>> <https://plus.google.com/app/basic?nopromo=1&source=mog&gl=uk> >>> <http://mail.google.com/mail/x/mog-/gp/?source=mog&gl=uk> >>> Calendar >>> <https://www.google.com/calendar/gpcal?source=mog&gl=uk> >>> Web >>> <http://www.google.co.uk/?source=mog&gl=uk> >>> more >>> Inbox >>> Apache Spark Email >>> GmailNot Work >>> S >>> sam.sav...@barclays.com >>> to me >>> 0 minutes ago >>> Details >>> According to >>> https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics >>> >>> The constructor takes `RDD[(Double, Double)]` meaning lables are >>> Doubles, this seems odd, shouldn't it be Boolean? Similarly for >>> MutlilabelMetrics (I.e. Should be RDD[(Array[Double], Array[Boolean])]), >>> and for MulticlassMetrics the type of both should be generic? >>> >>> Additionally it would be good if either the ROC output type was changed >>> or another method was added that returned confusion matricies, so that the >>> hard integer values can be obtained before the divisions. E.g. >>> >>> ``` >>> case class Confusion(tp: Int, fp: Int, fn: Int, tn: Int) >>> { >>> // bunch of methods for each of the things in the table here >>> https://en.wikipedia.org/wiki/Receiver_operating_characteristic >>> } >>> ... >>> def confusions(): RDD[Confusion] >>> ``` >>> >> >> >