Hi, LogisticAggregator [1] scales every sample on every iteration. Without scaling binaryUpdateInPlace could be rewritten using BLAS.dot and that would significantly improve performance. However, there is a comment [2] saying that standardization and caching of the dataset before training will "create a lot of overhead".
What kind of overhead it is all about and what is rationale to avoid scaling dataset prior training? Thanks, Filipp. [1] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L229 [2] https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L40 --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org