[jira] [Created] (SPARK-22555) Possibly incorrect scaling of L2 regularization strength in LinearRegression
Andrew Crosby created SPARK-22555: - Summary: Possibly incorrect scaling of L2 regularization strength in LinearRegression Key: SPARK-22555 URL: https://issues.apache.org/jira/browse/SPARK-22555 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Andrew Crosby Priority: Minor According to the Spark documentation, the linear regression estimator minimizes the regularized sum of squares: 1/N Sum(y - w x)^2^ + λ( (1-α) |w|~2~ + α |w|~1~ ) Under the hood, in order to improve convergence, the optimization algorithms actually work in scaled space using the variables y' = y / σ ~y~, x' = x / σ ~x~ and w' = w / (σ ~x~ / σ ~y~). In terms of these scaled variables, the above expression becomes: σ ~y~^2^ ( 1/N Sum(y' - w' x')^2^ + λ( (1-α) / σ ~x~^2^ |w'|~2~ + α / (σ ~x~ σ ~y~) |w'|~1~ ) ) The solution in scaled space is equivalent to the original problem, provided that the regularization strengths are suitably adjusted. The effective L1 regularization strength should be λ α / (σ ~x~ σ ~y~) and the effective L2 regularization strength should be λ (1-α) / σ ~x~^2^. However, this doesn't quite match the regularization strengths that are actually used. While the factors of σ ~x~ are correctly included (or correctly ommitted if the standardization parameter is set), it appears that the 1 / σ ~y~ scaling is applied to both the L1 and L2 regularization parameters instead of just to the L1 regularization parameter. Both LinearRegression.scala and WeightedLeastSquares.scala contain code along the following lines: {code} val effectiveRegParam = $(regParam) / yStd val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam {code} Admittedly, the unit tests confirm that the current behaviour matches that of R's glmnet, it just doesn't seem to match the behaviour claimed in the documentation. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23537) Logistic Regression without standardization
[ https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634263#comment-16634263 ] Andrew Crosby commented on SPARK-23537: --- The different results for standardization=True vs standardization=False are to be expected. The resason for this difference is that the two settings lead to different effective regularization strengths. With standardization=True, the regularization is applied to the scaled model coefficients. Whereas, with standardization=False, the regularization is applied to the unscaled model coefficients. As it's implemented in Spark, the features actually get scaled regardless of whether standardization is set to true or false, but when standardization=False the strength of the regularization in the scaled space is altered to account for this. See the comment at [https://github.com/apache/spark/blob/a802c69b130b69a35b372ffe1b01289577f6fafb/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L685.] As an aside, your results show a very slow rate of convergence when standardization is set to false. I believe this to be an issue caused by the continued application of feature scaling when standardization=False which can lead to very large gradients from the regularization terms in the solver. I've recently raised SPARK-25544 to cover this issue. > Logistic Regression without standardization > --- > > Key: SPARK-23537 > URL: https://issues.apache.org/jira/browse/SPARK-23537 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.2.1 >Reporter: Jordi >Priority: Major > Attachments: non-standardization.log, standardization.log > > > I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer > to not use standardization since all my features are binary, using the > hashing trick (2^20 sparse vector). > I trained two models to compare results, I've been expecting to end with two > similar models since it seems that internally the optimizer performs > standardization and "de-standardization" (when it's deactivated) in order to > improve the convergence. > Here you have the code I used: > {code:java} > val lr = new org.apache.spark.ml.classification.LogisticRegression() > .setRegParam(0.05) > .setElasticNetParam(0.0) > .setFitIntercept(true) > .setMaxIter(5000) > .setStandardization(false) > val model = lr.fit(data) > {code} > The results are disturbing me, I end with two significantly different models. > *Standardization:* > Training time: 8min. > Iterations: 37 > Intercept: -4.386090107224499 > Max weight: 4.724752299455218 > Min weight: -3.560570478164854 > Mean weight: -0.049325201841722795 > l1 norm: 116710.39522171849 > l2 norm: 402.2581552373957 > Non zero weights: 128084 > Non zero ratio: 0.12215042114257812 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) > 0.000559057 > 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) > 0.000267527 > 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) > 0.000205888 > 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) > 0.000144173 > 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) > 0.000140296 > 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) > 0.000122709 > 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) > 3.08789e-05 > 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) > 2.23806e-05 > 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) > 1.47422e-05 > 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) > 2.37442e-05 > {code} > *No standardization:* > Training time: 7h 14 min. > Iterations: 4992 > Intercept: -4.216690468849263 > Max weight: 0.41930559767624725 > Min weight: -0.5949182537565524 > Mean weight: -1.2659769019012E-6 > l1 norm: 14.262025330648694 > l2 norm: 1.2508777025612263 > Non zero weights: 128955 > Non zero ratio: 0.12298107147216797 > Last 10 LBFGS Val and Grand Norms: > {code:java} > 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) > 0.217581 > 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) > 0.185812 > 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) > 0.214570 > 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) > 0.489464 > 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) > 0.178448 > 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) > 0.172527 > 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm:
[jira] [Commented] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634268#comment-16634268 ] Andrew Crosby commented on SPARK-25544: --- SPARK-23537 contains what might be another occurrence of this issue. The model in that case contains only binary features, so standardization shouldn't really be used. However, turning standardization off causes the model to take 4992 iterations to converge as opposed to 37 iterations when standardization is turned on. > Slow/failed convergence in Spark ML models due to internal predictor scaling > > > Key: SPARK-25544 > URL: https://issues.apache.org/jira/browse/SPARK-25544 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.2 > Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 >Reporter: Andrew Crosby >Priority: Major > > The LinearRegression and LogisticRegression estimators in Spark ML can take a > large number of iterations to converge, or fail to converge altogether, when > trained using the l-bfgs method with standardization turned off. > *Details:* > LinearRegression and LogisticRegression standardize their input features by > default. In SPARK-8522 the option to disable standardization was added. This > is implemented internally by changing the effective strength of > regularization rather than disabling the feature scaling. Mathematically, > both changing the effective regularizaiton strength, and disabling feature > scaling should give the same solution, but they can have very different > convergence properties. > The normal justication given for scaling features is that it ensures that all > covariances are O(1) and should improve numerical convergence, but this > argument does not account for the regularization term. This doesn't cause any > issues if standardization is set to true, since all features will have an > O(1) regularization strength. But it does cause issues when standardization > is set to false, since the effecive regularization strength of feature i is > now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. > This means that predictors with small standard deviations (which can occur > legitimately e.g. via one hot encoding) will have very large effective > regularization strengths and consequently lead to very large gradients and > thus poor convergence in the solver. > *Example code to recreate:* > To demonstrate just how bad these convergence issues can be, here is a very > simple test case which builds a linear regression model with a categorical > feature, a numerical feature and their interaction. When fed the specified > training data, this model will fail to converge before it hits the maximum > iteration limit. In this case, it is the interaction between category "2" and > the numeric feature that leads to a feature with a small standard deviation. > Training data: > ||category||numericFeature||label|| > |1|1.0|0.5| > |1|0.5|1.0| > |2|0.01|2.0| > > {code:java} > val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, > 2.0)).toDF("category", "numericFeature", "label") > val indexer = new StringIndexer().setInputCol("category") > .setOutputCol("categoryIndex") > val encoder = new > OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) > val interaction = new Interaction().setInputCols(Array("categoryEncoded", > "numericFeature")).setOutputCol("interaction") > val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", > "interaction")).setOutputCol("features") > val model = new > LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) > val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, > assembler, model)) > val pipelineModel = pipeline.fit(df) > val numIterations = > pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} > *Possible fix:* > These convergence issues can be fixed by turning off feature scaling when > standardization is set to false rather than using an effective regularization > strength. This can be hacked into LinearRegression.scala by simply replacing > line 423 > {code:java} > val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) > {code} > with > {code:java} > val featuresStd = if ($(standardization)) > featuresSummarizer.variance.toArray.map(math.sqrt) else > featuresSummarizer.variance.toArray.map(x => 1.0) > {code} > Rerunning the above test code with that hack in place, will lead to > convergence after just 4 iterations instead of hitting the max iterations > limit! > *Impact:* > I
[jira] [Created] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
Andrew Crosby created SPARK-25544: - Summary: Slow/failed convergence in Spark ML models due to internal predictor scaling Key: SPARK-25544 URL: https://issues.apache.org/jira/browse/SPARK-25544 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.2 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11 Reporter: Andrew Crosby The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning of feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton
[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling
[ https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-25544: -- Description: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling. Mathematically, both changing the effective regularizaiton strength, and disabling feature scaling should give the same solution, but they can have very different convergence properties. The normal justication given for scaling features is that it ensures that all covariances are O(1) and should improve numerical convergence, but this argument does not account for the regularization term. This doesn't cause any issues if standardization is set to true, since all features will have an O(1) regularization strength. But it does cause issues when standardization is set to false, since the effecive regularization strength of feature i is now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. This means that predictors with small standard deviations (which can occur legitimately e.g. via one hot encoding) will have very large effective regularization strengths and consequently lead to very large gradients and thus poor convergence in the solver. *Example code to recreate:* To demonstrate just how bad these convergence issues can be, here is a very simple test case which builds a linear regression model with a categorical feature, a numerical feature and their interaction. When fed the specified training data, this model will fail to converge before it hits the maximum iteration limit. In this case, it is the interaction between category "2" and the numeric feature that leads to a feature with a small standard deviation. Training data: ||category||numericFeature||label|| |1|1.0|0.5| |1|0.5|1.0| |2|0.01|2.0| {code:java} val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 2.0)).toDF("category", "numericFeature", "label") val indexer = new StringIndexer().setInputCol("category") .setOutputCol("categoryIndex") val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false) val interaction = new Interaction().setInputCols(Array("categoryEncoded", "numericFeature")).setOutputCol("interaction") val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", "interaction")).setOutputCol("features") val model = new LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100) val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, assembler, model)) val pipelineModel = pipeline.fit(df) val numIterations = pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code} *Possible fix:* These convergence issues can be fixed by turning off feature scaling when standardization is set to false rather than using an effective regularization strength. This can be hacked into LinearRegression.scala by simply replacing line 423 {code:java} val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt) {code} with {code:java} val featuresStd = if ($(standardization)) featuresSummarizer.variance.toArray.map(math.sqrt) else featuresSummarizer.variance.toArray.map(x => 1.0) {code} Rerunning the above test code with that hack in place, will lead to convergence after just 4 iterations instead of hitting the max iterations limit! *Impact:* I can't speak for other people, but I've personally encountered these convergence issues several times when building production scale Spark ML models, and have resorted to writing my only implementation of LinearRegression with the above hack in place. The issue is made worse by the fact that Spark does not raise an error when the maximum number of iterations is hit, so the first time you encounter the issue it can take a while to figure out what is going on. was: The LinearRegression and LogisticRegression estimators in Spark ML can take a large number of iterations to converge, or fail to converge altogether, when trained using the l-bfgs method with standardization turned off. *Details:* LinearRegression and LogisticRegression standardize their input features by default. In SPARK-8522 the option to disable standardization was added. This is implemented internally by changing the effective strength of regularization rather than disabling the feature scaling.
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer ( [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] ) is missing from the set of pyspark feature transformers ( [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] ). This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers ([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]. This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer ( > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] > ) is missing from the set of pyspark feature transformers ( > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] > ). > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
Andrew Crosby created SPARK-26970: - Summary: Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer Key: SPARK-26970 URL: https://issues.apache.org/jira/browse/SPARK-26970 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Andrew Crosby The Interaction transformer ([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers ([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]. This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Crosby updated SPARK-26970: -- Description: The Interaction transformer [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] is missing from the set of pyspark feature transformers [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} was: The Interaction transformer ( [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] ) is missing from the set of pyspark feature transformers ( [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] ). This means that it is impossible to create a model that includes an Interaction transformer with pyspark. It also means that attempting to load a PipelineModel created in Scala that includes an Interaction transformer with pyspark fails with the following error: {code:java} AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' {code} > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Major > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28062) HuberAggregator copies coefficients vector every time an instance is added
Andrew Crosby created SPARK-28062: - Summary: HuberAggregator copies coefficients vector every time an instance is added Key: SPARK-28062 URL: https://issues.apache.org/jira/browse/SPARK-28062 Project: Spark Issue Type: Bug Components: ML Affects Versions: 3.0.0 Reporter: Andrew Crosby Every time an instance is added to the HuberAggregator, a copy of the coefficients vector is created (see code snippet below). This causes a performance degradation, which is particularly severe when the instances have long sparse feature vectors. {code:scala} def add(instance: Instance): HuberAggregator = { instance match { case Instance(label, weight, features) => require(numFeatures == features.size, s"Dimensions mismatch when adding new sample." + s" Expecting $numFeatures but got ${features.size}.") require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0") if (weight == 0.0) return this val localFeaturesStd = bcFeaturesStd.value val localCoefficients = bcParameters.value.toArray.slice(0, numFeatures) val localGradientSumArray = gradientSumArray // Snip } {code} The LeastSquaresAggregator class avoids this performance issue via the use of transient lazy class variables to store such reused values. Applying a similar approach to HuberAggregator gives a significant speed boost. Running the script below locally on my machine gives the following timing results: {noformat} Current implementation: Time(s): 540.1439919471741 Iterations: 26 Intercept: 0.518109382890512 Coefficients: [0.0, -0.2516936902000245, 0.0, 0.0, -0.19633887469839809, 0.0, -0.39565545053893925, 0.0, -0.18617574426698882, 0.0478922416670529] Modified implementation to match LeastSquaresAggregator: Time(s): 46.82946586608887 Iterations: 26 Intercept: 0.5181093828893774 Coefficients: [0.0, -0.25169369020031357, 0.0, 0.0, -0.1963388746927919, 0.0, -0.3956554505389966, 0.0, -0.18617574426702874, 0.04789224166878518] {noformat} {code:python} from random import random, randint, seed import time from pyspark.ml.feature import OneHotEncoder from pyspark.ml.regression import LinearRegression from pyspark.sql import SparkSession seed(0) spark = SparkSession.builder.appName('huber-speed-test').getOrCreate() df = spark.createDataFrame([[randint(0, 10), random()] for i in range(10)], ["category", "target"]) ohe = OneHotEncoder(inputCols=["category"], outputCols=["encoded_category"]).fit(df) lr = LinearRegression(featuresCol="encoded_category", labelCol="target", loss="huber", regParam=1.0) start = time.time() model = lr.fit(ohe.transform(df)) end = time.time() print("Time(s): " + str(end - start)) print("Iterations: " + str(model.summary.totalIterations)) print("Intercept: " + str(model.intercept)) print("Coefficients: " + str(list(model.coefficients)[0:10])) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer
[ https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822548#comment-16822548 ] Andrew Crosby commented on SPARK-26970: --- The code changes required for this looked relatively straightforward so I've had a go at creating a pull request myself (https://github.com/apache/spark/pull/24426) [~huaxingao] apologies if I've duplicated work that you've already done. > Can't load PipelineModel that was created in Scala with Python due to missing > Interaction transformer > - > > Key: SPARK-26970 > URL: https://issues.apache.org/jira/browse/SPARK-26970 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Andrew Crosby >Priority: Minor > > The Interaction transformer > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala] > is missing from the set of pyspark feature transformers > [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py] > > This means that it is impossible to create a model that includes an > Interaction transformer with pyspark. It also means that attempting to load a > PipelineModel created in Scala that includes an Interaction transformer with > pyspark fails with the following error: > {code:java} > AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org