Cross-posting as I got no response on the users mailing list last
week. Any response would be appreciated :)

Josh


---------- Forwarded message ----------
From: Josh Devins <j...@soundcloud.com>
Date: 9 February 2015 at 15:59
Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm
To: "u...@spark.apache.org" <u...@spark.apache.org>


I've been looking into a performance problem when using
LogisticRegressionWithLBFGS (and in turn GeneralizedLinearAlgorithm).
Here's an outline of what I've figured out so far and it would be
great to get some confirmation of the problem, some input on how
wide-spread this problem might be and any ideas on a nice way to fix
this.

Context:
- I will reference `branch-1.1` as we are currently on v1.1.1 however
this appears to still be a problem on `master`
- The cluster is run on YARN, on bare-metal hardware (no VMs)
- I've not filed a Jira issue yet but can do so
- This problem affects all algorithms based on
GeneralizedLinearAlgorithm (GLA) that use feature scaling (and less so
when not, but still a problem) (e.g. LogisticRegressionWithLBFGS)

Problem Outline:
- Starting at GLA line 177
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L177),
a feature scaler is created using the `input` RDD
- Refer next to line 186 which then maps over the `input` RDD and
produces a new `data` RDD
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L186)
- If you are using feature scaling or adding intercepts, the user
`input` RDD has been mapped over *after* the user has persisted it
(hopefully) and *before* going into the (iterative) optimizer on line
204 
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala#L204)
- Since the RDD `data` that is iterated over in the optimizer is
unpersisted, when we are running the cost function in the optimizer
(e.g. LBFGS -- 
https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala#L198),
the map phase will actually first go back and rerun the feature
scaling (map tasks on `input`) and then map with the cost function
(two maps pipelined into one stage)
- As a result, parts of the StandardScaler will actually be run again
(perhaps only because the variable is `lazy`?) and this can be costly,
see line 84 
(https://github.com/apache/spark/blob/branch-1.1/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala#L84)
- For small datasets and/or few iterations, this is not really a
problem, however we found that by adding a `data.persist()` right
before running the optimizer, we went from map iterations in the
optimizer that went from 5:30 down to 0:45

I had a very tough time coming up with a nice way to describe my
debugging sessions in an email so I hope this gets the main points
across. Happy to clarify anything if necessary (also by live
debugging/Skype/phone if that's helpful).

Thanks,

Josh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to