OK, I've opened a jira. https://issues.apache.org/jira/browse/SPARK-17718
And ok, I forgot the loss is summed in the objective function provided. My mistake. On a tangentially related topic, why is there a half in front of the squared loss? Similarly, the L2 regularizer has a half. It's just a constant and so the objective's minimum is not affected, but still curious to know why the half wasn't left out. On Mon, Sep 26, 2016 at 4:40 AM, Sean Owen <so...@cloudera.com> wrote: > Yes I think that footnote could be a lot more prominent, or pulled up > right under the table. > > I also think it would be fine to present the {0,1} formulation. It's > actually more recognizable, I think, for log-loss in that form. It's > probably less recognizable for hinge loss, but, consistency is more > important. There's just an extra (2y-1) term, at worst. > > The loss here is per instance, and implicitly summed over all > instances. I think that is probably not confusing for the reader; if > they're reading this at all to double-check just what formulation is > being used, I think they'd know that. But, it's worth a note. > > The loss is summed in the case of log-loss, not multiplied (if that's > what you're saying). > > Those are decent improvements, feel free to open a pull request / JIRA. > > > On Mon, Sep 26, 2016 at 6:22 AM, Tobi Bosede <ani.to...@gmail.com> wrote: > > The loss function here for logistic regression is confusing. It seems to > > imply that spark uses only -1 and 1 class labels. However it uses 0,1 as > the > > very inconspicuous note quoted below (under Classification) says. We > need to > > make this point more visible to avoid confusion. > > > > Better yet, we should replace the loss function listed with that for 0, > 1 no > > matter how mathematically inconvenient, since that is what is actually > > implemented in Spark. > > > > More problematic, the loss function (even in this "convenient" form) is > > actually incorrect. This is because it is missing either a summation > (sigma) > > in the log or product (pi) outside the log, as the loss for logistic is > the > > log likelihood. So there are multiple problems with the documentation. > > Please advise on steps to fix for all version documentation or if there > are > > already some in place. > > > > "Note that, in the mathematical formulation in this guide, a binary label > > y is denoted as either +1 (positive) or −1 (negative), which is > convenient > > for the formulation. However, the negative label is represented by 0 in > > spark.mllib instead of −1, to be consistent with multiclass labeling." >