[jira] [Created] (SPARK-22555) Possibly incorrect scaling of L2 regularization strength in LinearRegression

2017-11-19 Thread Andrew Crosby (JIRA)
Andrew Crosby created SPARK-22555:
-

 Summary: Possibly incorrect scaling of L2 regularization strength 
in LinearRegression
 Key: SPARK-22555
 URL: https://issues.apache.org/jira/browse/SPARK-22555
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Andrew Crosby
Priority: Minor


According to the Spark documentation, the linear regression estimator minimizes 
the regularized sum of squares:

1/N Sum(y - w x)^2^ + λ( (1-α) |w|~2~ + α |w|~1~ )

Under the hood, in order to improve convergence, the optimization algorithms 
actually work in scaled space using the variables y' = y / σ ~y~, x' = x / σ 
~x~ and w' = w / (σ ~x~ / σ ~y~). In terms of these scaled variables, the above 
expression becomes:

σ ~y~^2^ ( 1/N  Sum(y' - w' x')^2^ + λ( (1-α) / σ ~x~^2^ |w'|~2~ + α / (σ ~x~ σ 
~y~) |w'|~1~ ) )

The solution in scaled space is equivalent to the original problem, provided 
that the regularization strengths are suitably adjusted. The effective L1 
regularization strength should be λ α / (σ ~x~ σ ~y~) and the effective L2 
regularization strength should be λ (1-α) / σ ~x~^2^.

However, this doesn't quite match the regularization strengths that are 
actually used. While the factors of σ ~x~ are correctly included (or correctly 
ommitted if the standardization parameter is set), it appears that the 1 / σ 
~y~ scaling is applied to both the L1 and L2 regularization parameters instead 
of just to the L1 regularization parameter. Both LinearRegression.scala and 
WeightedLeastSquares.scala contain code along the following lines:

{code}
val effectiveRegParam = $(regParam) / yStd
val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
{code}

Admittedly, the unit tests confirm that the current behaviour matches that of 
R's glmnet, it just doesn't seem to match the behaviour claimed in the 
documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23537) Logistic Regression without standardization

2018-10-01 Thread Andrew Crosby (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634263#comment-16634263
 ] 

Andrew Crosby commented on SPARK-23537:
---

The different results for standardization=True vs standardization=False are to 
be expected. The resason for this difference is that the two settings lead to 
different effective regularization strengths.  With standardization=True, the 
regularization is applied to the scaled model coefficients. Whereas, with 
standardization=False, the regularization is applied to the unscaled model 
coefficients.

As it's implemented in Spark, the features actually get scaled regardless of 
whether standardization is set to true or false, but when standardization=False 
the strength of the regularization in the scaled space is altered to account 
for this. See the comment at 
[https://github.com/apache/spark/blob/a802c69b130b69a35b372ffe1b01289577f6fafb/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L685.]

 

As an aside, your results show a very slow rate of convergence when 
standardization is set to false. I believe this to be an issue caused by the 
continued application of feature scaling when standardization=False which can 
lead to very large gradients from the regularization terms in the solver. I've 
recently raised SPARK-25544 to cover this issue.

> Logistic Regression without standardization
> ---
>
> Key: SPARK-23537
> URL: https://issues.apache.org/jira/browse/SPARK-23537
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.2.1
>Reporter: Jordi
>Priority: Major
> Attachments: non-standardization.log, standardization.log
>
>
> I'm trying to train a Logistic Regression model, using Spark 2.2.1. I prefer 
> to not use standardization since all my features are binary, using the 
> hashing trick (2^20 sparse vector).
> I trained two models to compare results, I've been expecting to end with two 
> similar models since it seems that internally the optimizer performs 
> standardization and "de-standardization" (when it's deactivated) in order to 
> improve the convergence.
> Here you have the code I used:
> {code:java}
> val lr = new org.apache.spark.ml.classification.LogisticRegression()
> .setRegParam(0.05)
> .setElasticNetParam(0.0)
> .setFitIntercept(true)
> .setMaxIter(5000)
> .setStandardization(false)
> val model = lr.fit(data)
> {code}
> The results are disturbing me, I end with two significantly different models.
> *Standardization:*
> Training time: 8min.
> Iterations: 37
> Intercept: -4.386090107224499
> Max weight: 4.724752299455218
> Min weight: -3.560570478164854
> Mean weight: -0.049325201841722795
> l1 norm: 116710.39522171849
> l2 norm: 402.2581552373957
> Non zero weights: 128084
> Non zero ratio: 0.12215042114257812
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/27 17:14:45 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 8.00e-07) 
> 0.000559057
> 18/02/27 17:14:50 INFO LBFGS: Val and Grad Norm: 0.430740 (rel: 3.94e-07) 
> 0.000267527
> 18/02/27 17:14:54 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.62e-07) 
> 0.000205888
> 18/02/27 17:14:59 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.36e-07) 
> 0.000144173
> 18/02/27 17:15:04 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 7.74e-08) 
> 0.000140296
> 18/02/27 17:15:09 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.52e-08) 
> 0.000122709
> 18/02/27 17:15:13 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 1.78e-08) 
> 3.08789e-05
> 18/02/27 17:15:18 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 2.66e-09) 
> 2.23806e-05
> 18/02/27 17:15:23 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 4.31e-09) 
> 1.47422e-05
> 18/02/27 17:15:28 INFO LBFGS: Val and Grad Norm: 0.430739 (rel: 9.17e-10) 
> 2.37442e-05
> {code}
> *No standardization:*
> Training time: 7h 14 min.
> Iterations: 4992
> Intercept: -4.216690468849263
> Max weight: 0.41930559767624725
> Min weight: -0.5949182537565524
> Mean weight: -1.2659769019012E-6
> l1 norm: 14.262025330648694
> l2 norm: 1.2508777025612263
> Non zero weights: 128955
> Non zero ratio: 0.12298107147216797
> Last 10 LBFGS Val and Grand Norms:
> {code:java}
> 18/02/28 00:28:56 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 2.17e-07) 
> 0.217581
> 18/02/28 00:29:01 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.88e-07) 
> 0.185812
> 18/02/28 00:29:06 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.33e-07) 
> 0.214570
> 18/02/28 00:29:11 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 8.62e-08) 
> 0.489464
> 18/02/28 00:29:16 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 1.90e-07) 
> 0.178448
> 18/02/28 00:29:21 INFO LBFGS: Val and Grad Norm: 0.559320 (rel: 7.91e-08) 
> 0.172527
> 18/02/28 00:29:26 INFO LBFGS: Val and Grad Norm: 

[jira] [Commented] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-10-01 Thread Andrew Crosby (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634268#comment-16634268
 ] 

Andrew Crosby commented on SPARK-25544:
---

SPARK-23537 contains what might be another occurrence of this issue. The model 
in that case contains only binary features, so standardization shouldn't really 
be used. However, turning standardization off causes the model to take 4992 
iterations to converge as opposed to 37 iterations when standardization is 
turned on.

> Slow/failed convergence in Spark ML models due to internal predictor scaling
> 
>
> Key: SPARK-25544
> URL: https://issues.apache.org/jira/browse/SPARK-25544
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.2
> Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
>Reporter: Andrew Crosby
>Priority: Major
>
> The LinearRegression and LogisticRegression estimators in Spark ML can take a 
> large number of iterations to converge, or fail to converge altogether, when 
> trained using the l-bfgs method with standardization turned off.
> *Details:*
> LinearRegression and LogisticRegression standardize their input features by 
> default. In SPARK-8522 the option to disable standardization was added. This 
> is implemented internally by changing the effective strength of 
> regularization rather than disabling the feature scaling. Mathematically, 
> both changing the effective regularizaiton strength, and disabling feature 
> scaling should give the same solution, but they can have very different 
> convergence properties.
> The normal justication given for scaling features is that it ensures that all 
> covariances are O(1) and should improve numerical convergence, but this 
> argument does not account for the regularization term. This doesn't cause any 
> issues if standardization is set to true, since all features will have an 
> O(1) regularization strength. But it does cause issues when standardization 
> is set to false, since the effecive regularization strength of feature i is 
> now O(1/ sigma_i^2) where sigma_i is the standard deviation of the feature. 
> This means that predictors with small standard deviations (which can occur 
> legitimately e.g. via one hot encoding) will have very large effective 
> regularization strengths and consequently lead to very large gradients and 
> thus poor convergence in the solver.
> *Example code to recreate:*
> To demonstrate just how bad these convergence issues can be, here is a very 
> simple test case which builds a linear regression model with a categorical 
> feature, a numerical feature and their interaction. When fed the specified 
> training data, this model will fail to converge before it hits the maximum 
> iteration limit. In this case, it is the interaction between category "2" and 
> the numeric feature that leads to a feature with a small standard deviation.
> Training data:
> ||category||numericFeature||label||
> |1|1.0|0.5|
> |1|0.5|1.0|
> |2|0.01|2.0|
>  
> {code:java}
> val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
> 2.0)).toDF("category", "numericFeature", "label")
> val indexer = new StringIndexer().setInputCol("category") 
> .setOutputCol("categoryIndex")
> val encoder = new 
> OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
> val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
> "numericFeature")).setOutputCol("interaction")
> val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
> "interaction")).setOutputCol("features")
> val model = new 
> LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
> val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
> assembler, model))
> val pipelineModel  = pipeline.fit(df)
> val numIterations = 
> pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
>  *Possible fix:*
> These convergence issues can be fixed by turning off feature scaling when 
> standardization is set to false rather than using an effective regularization 
> strength. This can be hacked into LinearRegression.scala by simply replacing 
> line 423
> {code:java}
> val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
> {code}
> with
> {code:java}
> val featuresStd = if ($(standardization)) 
> featuresSummarizer.variance.toArray.map(math.sqrt) else 
> featuresSummarizer.variance.toArray.map(x => 1.0)
> {code}
> Rerunning the above test code with that hack in place, will lead to 
> convergence after just 4 iterations instead of hitting the max iterations 
> limit!
> *Impact:*
> I 

[jira] [Created] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)
Andrew Crosby created SPARK-25544:
-

 Summary: Slow/failed convergence in Spark ML models due to 
internal predictor scaling
 Key: SPARK-25544
 URL: https://issues.apache.org/jira/browse/SPARK-25544
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.2
 Environment: Databricks runtime 4.2: Spark 2.3.1, Scala 2.11
Reporter: Andrew Crosby


The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning of feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal 

[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations will have very large effective 
regularization strengths and consequently lead to very large gradients and thus 
poor convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit. In this case, it is the interaction between category "2" and 
the numeric feature that leads to a feature with a small standard deviation.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton 

[jira] [Updated] (SPARK-25544) Slow/failed convergence in Spark ML models due to internal predictor scaling

2018-09-26 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-25544:
--
Description: 
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. Mathematically, both changing the 
effective regularizaiton strength, and disabling feature scaling should give 
the same solution, but they can have very different convergence properties.

The normal justication given for scaling features is that it ensures that all 
covariances are O(1) and should improve numerical convergence, but this 
argument does not account for the regularization term. This doesn't cause any 
issues if standardization is set to true, since all features will have an O(1) 
regularization strength. But it does cause issues when standardization is set 
to false, since the effecive regularization strength of feature i is now O(1/ 
sigma_i^2) where sigma_i is the standard deviation of the feature. This means 
that predictors with small standard deviations (which can occur legitimately 
e.g. via one hot encoding) will have very large effective regularization 
strengths and consequently lead to very large gradients and thus poor 
convergence in the solver.

*Example code to recreate:*

To demonstrate just how bad these convergence issues can be, here is a very 
simple test case which builds a linear regression model with a categorical 
feature, a numerical feature and their interaction. When fed the specified 
training data, this model will fail to converge before it hits the maximum 
iteration limit. In this case, it is the interaction between category "2" and 
the numeric feature that leads to a feature with a small standard deviation.

Training data:
||category||numericFeature||label||
|1|1.0|0.5|
|1|0.5|1.0|
|2|0.01|2.0|

 
{code:java}
val df = Seq(("1", 1.0, 0.5), ("1", 0.5, 1.0), ("2", 1e-2, 
2.0)).toDF("category", "numericFeature", "label")

val indexer = new StringIndexer().setInputCol("category") 
.setOutputCol("categoryIndex")
val encoder = new 
OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryEncoded").setDropLast(false)
val interaction = new Interaction().setInputCols(Array("categoryEncoded", 
"numericFeature")).setOutputCol("interaction")
val assembler = new VectorAssembler().setInputCols(Array("categoryEncoded", 
"interaction")).setOutputCol("features")
val model = new 
LinearRegression().setFeaturesCol("features").setLabelCol("label").setPredictionCol("prediction").setStandardization(false).setSolver("l-bfgs").setRegParam(1.0).setMaxIter(100)
val pipeline = new Pipeline().setStages(Array(indexer, encoder, interaction, 
assembler, model))

val pipelineModel  = pipeline.fit(df)

val numIterations = 
pipelineModel.stages(4).asInstanceOf[LinearRegressionModel].summary.totalIterations{code}
 *Possible fix:*

These convergence issues can be fixed by turning off feature scaling when 
standardization is set to false rather than using an effective regularization 
strength. This can be hacked into LinearRegression.scala by simply replacing 
line 423
{code:java}
val featuresStd = featuresSummarizer.variance.toArray.map(math.sqrt)
{code}
with
{code:java}
val featuresStd = if ($(standardization)) 
featuresSummarizer.variance.toArray.map(math.sqrt) else 
featuresSummarizer.variance.toArray.map(x => 1.0)
{code}
Rerunning the above test code with that hack in place, will lead to convergence 
after just 4 iterations instead of hitting the max iterations limit!

*Impact:*

I can't speak for other people, but I've personally encountered these 
convergence issues several times when building production scale Spark ML 
models, and have resorted to writing my only implementation of LinearRegression 
with the above hack in place. The issue is made worse by the fact that Spark 
does not raise an error when the maximum number of iterations is hit, so the 
first time you encounter the issue it can take a while to figure out what is 
going on.

 

  was:
The LinearRegression and LogisticRegression estimators in Spark ML can take a 
large number of iterations to converge, or fail to converge altogether, when 
trained using the l-bfgs method with standardization turned off.

*Details:*

LinearRegression and LogisticRegression standardize their input features by 
default. In SPARK-8522 the option to disable standardization was added. This is 
implemented internally by changing the effective strength of regularization 
rather than disabling the feature scaling. 

[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer ( 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 ) is missing from the set of pyspark feature transformers ( 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 ).

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)].

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer ( 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
>  ) is missing from the set of pyspark feature transformers ( 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
>  ).
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)
Andrew Crosby created SPARK-26970:
-

 Summary: Can't load PipelineModel that was created in Scala with 
Python due to missing Interaction transformer
 Key: SPARK-26970
 URL: https://issues.apache.org/jira/browse/SPARK-26970
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Andrew Crosby


The Interaction transformer 
([https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
([https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)].

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-02-22 Thread Andrew Crosby (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Crosby updated SPARK-26970:
--
Description: 
The Interaction transformer 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 is missing from the set of pyspark feature transformers 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}

  was:
The Interaction transformer ( 
[https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
 ) is missing from the set of pyspark feature transformers ( 
[https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
 ).

 

This means that it is impossible to create a model that includes an Interaction 
transformer with pyspark. It also means that attempting to load a PipelineModel 
created in Scala that includes an Interaction transformer with pyspark fails 
with the following error:
{code:java}
AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
{code}


> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Major
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala)]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py)]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28062) HuberAggregator copies coefficients vector every time an instance is added

2019-06-15 Thread Andrew Crosby (JIRA)
Andrew Crosby created SPARK-28062:
-

 Summary: HuberAggregator copies coefficients vector every time an 
instance is added
 Key: SPARK-28062
 URL: https://issues.apache.org/jira/browse/SPARK-28062
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 3.0.0
Reporter: Andrew Crosby


Every time an instance is added to the HuberAggregator, a copy of the 
coefficients vector is created (see code snippet below). This causes a 
performance degradation, which is particularly severe when the instances have 
long sparse feature vectors.

{code:scala}
def add(instance: Instance): HuberAggregator = {
instance match { case Instance(label, weight, features) =>
  require(numFeatures == features.size, s"Dimensions mismatch when adding 
new sample." +
s" Expecting $numFeatures but got ${features.size}.")
  require(weight >= 0.0, s"instance weight, $weight has to be >= 0.0")

  if (weight == 0.0) return this
  val localFeaturesStd = bcFeaturesStd.value
  val localCoefficients = bcParameters.value.toArray.slice(0, numFeatures)
val localGradientSumArray = gradientSumArray

// Snip

}
{code}

The LeastSquaresAggregator class avoids this performance issue via the use of 
transient lazy class variables to store such reused values. Applying a similar 
approach to HuberAggregator gives a significant speed boost. Running the script 
below locally on my machine gives the following timing results:

{noformat}
Current implementation: 
Time(s): 540.1439919471741
Iterations: 26
Intercept: 0.518109382890512
Coefficients: [0.0, -0.2516936902000245, 0.0, 0.0, -0.19633887469839809, 
0.0, -0.39565545053893925, 0.0, -0.18617574426698882, 0.0478922416670529]

Modified implementation to match LeastSquaresAggregator:
Time(s): 46.82946586608887
Iterations: 26
Intercept: 0.5181093828893774
Coefficients: [0.0, -0.25169369020031357, 0.0, 0.0, -0.1963388746927919, 
0.0, -0.3956554505389966, 0.0, -0.18617574426702874, 0.04789224166878518]
{noformat}




{code:python}
from random import random, randint, seed
import time

from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.regression import LinearRegression
from pyspark.sql import SparkSession

seed(0)

spark = SparkSession.builder.appName('huber-speed-test').getOrCreate()
df = spark.createDataFrame([[randint(0, 10), random()] for i in 
range(10)],  ["category", "target"])
ohe = OneHotEncoder(inputCols=["category"], 
outputCols=["encoded_category"]).fit(df)
lr = LinearRegression(featuresCol="encoded_category", labelCol="target", 
loss="huber", regParam=1.0)

start = time.time()
model = lr.fit(ohe.transform(df))
end = time.time()

print("Time(s): " + str(end - start))
print("Iterations: " + str(model.summary.totalIterations))
print("Intercept: " + str(model.intercept))
print("Coefficients: " + str(list(model.coefficients)[0:10]))
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26970) Can't load PipelineModel that was created in Scala with Python due to missing Interaction transformer

2019-04-20 Thread Andrew Crosby (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16822548#comment-16822548
 ] 

Andrew Crosby commented on SPARK-26970:
---

The code changes required for this looked relatively straightforward so I've 
had a go at creating a pull request myself 
(https://github.com/apache/spark/pull/24426)

[~huaxingao] apologies if I've duplicated work that you've already done.

> Can't load PipelineModel that was created in Scala with Python due to missing 
> Interaction transformer
> -
>
> Key: SPARK-26970
> URL: https://issues.apache.org/jira/browse/SPARK-26970
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Andrew Crosby
>Priority: Minor
>
> The Interaction transformer 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/Interaction.scala]
>  is missing from the set of pyspark feature transformers 
> [https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py|https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py]
>  
> This means that it is impossible to create a model that includes an 
> Interaction transformer with pyspark. It also means that attempting to load a 
> PipelineModel created in Scala that includes an Interaction transformer with 
> pyspark fails with the following error:
> {code:java}
> AttributeError: module 'pyspark.ml.feature' has no attribute 'Interaction'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org