[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm
[ https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157027#comment-16157027 ] Yanbo Liang commented on SPARK-21919: - [~srowen] You are right, that is caused by line search bug. The error log in 2.2.0 can tell us what happened. Thanks for dig into it. > inconsistent behavior of AFTsurvivalRegression algorithm > > > Key: SPARK-21919 > URL: https://issues.apache.org/jira/browse/SPARK-21919 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 > Environment: Spark Version: 2.2.0 > Cluster setup: Standalone single node > Python version: 3.5.2 >Reporter: Ashish Chopra > > Took the direct example from spark ml documentation. > {code} > training = spark.createDataFrame([ > (1.218, 1.0, Vectors.dense(1.560, -0.605)), > (2.949, 0.0, Vectors.dense(0.346, 2.158)), > (3.627, 0.0, Vectors.dense(1.380, 0.231)), > (0.273, 1.0, Vectors.dense(0.520, 1.151)), > (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result is: > Coefficients: [-0.496304411053,0.198452172529] > Intercept: 2.6380898963056327 > Scale: 1.5472363533632303 > ||label||censor||features ||prediction || quantiles || > |1.218|1.0 |[1.56,-0.605] |5.718985621018951 | > [1.160322990805951,4.99546058340675]| > |2.949|0.0 |[0.346,2.158] |18.07678210850554 > |[3.66759199449632,15.789837303662042]| > |3.627|0.0 |[1.38,0.231] |7.381908879359964 > |[1.4977129086101573,6.4480027195054905]| > |0.273|1.0 |[0.52,1.151] > |13.577717814884505|[2.754778414791513,11.859962351993202]| > |4.199|0.0 |[0.795,-0.226]|9.013087597344805 > |[1.828662187733188,7.8728164067854856]| > But if we change the value of all labels as label + 20. as: > {code} > training = spark.createDataFrame([ > (21.218, 1.0, Vectors.dense(1.560, -0.605)), > (22.949, 0.0, Vectors.dense(0.346, 2.158)), > (23.627, 0.0, Vectors.dense(1.380, 0.231)), > (20.273, 1.0, Vectors.dense(0.520, 1.151)), > (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result changes to: > Coefficients: [23.9932020748,3.18105314757] > Intercept: 7.35052273751137 > Scale: 7698609960.724161 > ||label ||censor||features ||prediction ||quantiles|| > |21.218|1.0 |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]| > |22.949|0.0 |[0.346,2.158] |6.011158613411288E9 |[0.0,0.0]| > |23.627|0.0 |[1.38,0.231] |7.7835948690311181E17|[0.0,0.0]| > |20.273|1.0 |[0.52,1.151] |1.5880852723124176E10|[0.0,0.0]| > |24.199|0.0 |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]| > Can someone please explain this exponential blow up in prediction, as per my > understanding prediction in AFT is a prediction of the time when the failure > event will occur, not able to understand why it will change exponentially > against the value of the label. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm
[ https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157013#comment-16157013 ] Sean Owen commented on SPARK-21919: --- Hm, yeah, I suppose I should have tried it too. On {{master}}, and in Scala, I get: {code} import org.apache.spark.ml.linalg._ import org.apache.spark.ml.regression._ val training = spark.createDataFrame(Seq( (21.218, 1.0, Vectors.dense(1.560, -0.605)), (22.949, 0.0, Vectors.dense(0.346, 2.158)), (23.627, 0.0, Vectors.dense(1.380, 0.231)), (20.273, 1.0, Vectors.dense(0.520, 1.151)), (24.199, 0.0, Vectors.dense(0.795, -0.226)) )).toDF("label", "censor", "features") val aft = new AFTSurvivalRegression(). setQuantileProbabilities(Array(0.3, 0.6)). setQuantilesCol("quantiles") val model = aft.fit(training) println(s"Coefficients: ${model.coefficients}") println(s"Intercept: ${model.intercept}") println(s"Scale: ${model.scale}") model.transform(training).show(truncate=false) {code} {code} 17/09/07 15:30:14 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5 17/09/07 15:30:14 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25 17/09/07 15:30:14 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5 17/09/07 15:30:14 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25 17/09/07 15:30:14 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125 ... +--+--+--+--+---+ |label |censor|features |prediction|quantiles | +--+--+--+--+---+ |21.218|1.0 |[1.56,-0.605] |24.20972861807431 |[21.617443110471118,23.97833624826161] | |22.949|0.0 |[0.346,2.158] |26.461225875981285|[23.627858619625105,26.208314087493857]| |23.627|0.0 |[1.38,0.231] |24.565240805031497|[21.934888406858644,24.330450511651165]| |20.273|1.0 |[0.52,1.151] |26.074003958175602|[23.28209894956245,25.82479316934075] | |24.199|0.0 |[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]| +--+--+--+--+---+ {code} But in 2.2.0, I get {code} ERROR optimize.LBFGS: Failure! Resetting history: breeze.optimize.FirstOrderException: Line search failed 17/09/07 14:32:35 ERROR optimize.LBFGS: Failure again! Giving up and returning. Maybe the objective is just poorly behaved? ... +--+--+--+-+-+ |label |censor|features |prediction |quantiles| +--+--+--+-+-+ |21.218|1.0 |[1.56,-0.605] |4.091244268823746E18 |[0.0,0.0]| |22.949|0.0 |[0.346,2.158] |6.011158613411288E9 |[0.0,0.0]| |23.627|0.0 |[1.38,0.231] |7.7835948690311731E17|[0.0,0.0]| |20.273|1.0 |[0.52,1.151] |1.5880852723124233E10|[0.0,0.0]| |24.199|0.0 |[0.795,-0.226]|1.459019088419373E11 |[0.0,0.0]| +--+--+--+-+-+ {code} So I'm almost sure this is just another symptom of the Breeze / strong Wolfe line search bug: https://issues.apache.org/jira/browse/SPARK-21523 > inconsistent behavior of AFTsurvivalRegression algorithm > > > Key: SPARK-21919 > URL: https://issues.apache.org/jira/browse/SPARK-21919 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 > Environment: Spark Version: 2.2.0 > Cluster setup: Standalone single node > Python version: 3.5.2 >Reporter: Ashish Chopra > > Took the direct example from spark ml documentation. > {code} > training = spark.createDataFrame([ > (1.218, 1.0, Vectors.dense(1.560, -0.605)), > (2.949, 0.0, Vectors.dense(0.346, 2.158)), > (3.627, 0.0, Vectors.dense(1.380, 0.231)), > (0.273, 1.0, Vectors.dense(0.520, 1.151)), > (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result is: > Co
[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm
[ https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156979#comment-16156979 ] Yanbo Liang commented on SPARK-21919: - [~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct result which is consistent with R {{survreg}}. {code} >>> from pyspark.ml.regression import AFTSurvivalRegression >>> from pyspark.ml.linalg import Vectors >>> training = spark.createDataFrame([ ... (21.218, 1.0, Vectors.dense(1.560, -0.605)), ... (22.949, 0.0, Vectors.dense(0.346, 2.158)), ... (23.627, 0.0, Vectors.dense(1.380, 0.231)), ... (20.273, 1.0, Vectors.dense(0.520, 1.151)), ... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", ... "features"]) >>> quantileProbabilities = [0.3, 0.6] >>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, ... quantilesCol="quantiles") >>> model = aft.fit(training) 17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5 17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25 17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.5 17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.25 17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in function evaluation. Decreasing step size to 0.125 >>> print("Coefficients: " + str(model.coefficients)) Coefficients: [-0.065814695216,0.00326705958509] >>> print("Intercept: " + str(model.intercept)) Intercept: 3.29140205698 >>> print("Scale: " + str(model.scale)) Scale: 0.109856123692 >>> model.transform(training).show(truncate=False) 17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS +--+--+--+--+---+ |label |censor|features |prediction|quantiles | +--+--+--+--+---+ |21.218|1.0 |[1.56,-0.605] |24.20972861807431 |[21.617443110471118,23.97833624826161] | |22.949|0.0 |[0.346,2.158] |26.461225875981285|[23.627858619625105,26.208314087493857]| |23.627|0.0 |[1.38,0.231] |24.565240805031497|[21.934888406858644,24.330450511651165]| |20.273|1.0 |[0.52,1.151] |26.074003958175602|[23.28209894956245,25.82479316934075] | |24.199|0.0 |[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]| +--+--+--+--+---+ {code} > inconsistent behavior of AFTsurvivalRegression algorithm > > > Key: SPARK-21919 > URL: https://issues.apache.org/jira/browse/SPARK-21919 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 > Environment: Spark Version: 2.2.0 > Cluster setup: Standalone single node > Python version: 3.5.2 >Reporter: Ashish Chopra > > Took the direct example from spark ml documentation. > {code} > training = spark.createDataFrame([ > (1.218, 1.0, Vectors.dense(1.560, -0.605)), > (2.949, 0.0, Vectors.dense(0.346, 2.158)), > (3.627, 0.0, Vectors.dense(1.380, 0.231)), > (0.273, 1.0, Vectors.dense(0.520, 1.151)), > (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result is: > Coefficients: [-0.496304411053,0.198452172529] > Intercept: 2.6380898963056327 > Scale: 1.5472363533632303 > ||label||censor||features ||prediction || quantiles || > |1.218|1.0 |[1.56,-0.605] |5.718985621018951 | > [1.160322990805951,4.99546058340675]| > |2.949|0.0 |[0.346,2.158] |18.07678210850554 > |[3.66759199449632,15.789837303662042]| > |3.627|0.0 |[1.38,0.231] |7.381908879359964 > |[1.4977129086101573,6.4480027195054905]| > |0.273|1.0 |[0.
[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm
[ https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156579#comment-16156579 ] Yanbo Liang commented on SPARK-21919: - [~srowen] I will take a look at it. Thanks. > inconsistent behavior of AFTsurvivalRegression algorithm > > > Key: SPARK-21919 > URL: https://issues.apache.org/jira/browse/SPARK-21919 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 > Environment: Spark Version: 2.2.0 > Cluster setup: Standalone single node > Python version: 3.5.2 >Reporter: Ashish Chopra > > Took the direct example from spark ml documentation. > {code} > training = spark.createDataFrame([ > (1.218, 1.0, Vectors.dense(1.560, -0.605)), > (2.949, 0.0, Vectors.dense(0.346, 2.158)), > (3.627, 0.0, Vectors.dense(1.380, 0.231)), > (0.273, 1.0, Vectors.dense(0.520, 1.151)), > (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result is: > Coefficients: [-0.496304411053,0.198452172529] > Intercept: 2.6380898963056327 > Scale: 1.5472363533632303 > ||label||censor||features ||prediction || quantiles || > |1.218|1.0 |[1.56,-0.605] |5.718985621018951 | > [1.160322990805951,4.99546058340675]| > |2.949|0.0 |[0.346,2.158] |18.07678210850554 > |[3.66759199449632,15.789837303662042]| > |3.627|0.0 |[1.38,0.231] |7.381908879359964 > |[1.4977129086101573,6.4480027195054905]| > |0.273|1.0 |[0.52,1.151] > |13.577717814884505|[2.754778414791513,11.859962351993202]| > |4.199|0.0 |[0.795,-0.226]|9.013087597344805 > |[1.828662187733188,7.8728164067854856]| > But if we change the value of all labels as label + 20. as: > {code} > training = spark.createDataFrame([ > (21.218, 1.0, Vectors.dense(1.560, -0.605)), > (22.949, 0.0, Vectors.dense(0.346, 2.158)), > (23.627, 0.0, Vectors.dense(1.380, 0.231)), > (20.273, 1.0, Vectors.dense(0.520, 1.151)), > (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result changes to: > Coefficients: [23.9932020748,3.18105314757] > Intercept: 7.35052273751137 > Scale: 7698609960.724161 > ||label ||censor||features ||prediction ||quantiles|| > |21.218|1.0 |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]| > |22.949|0.0 |[0.346,2.158] |6.011158613411288E9 |[0.0,0.0]| > |23.627|0.0 |[1.38,0.231] |7.7835948690311181E17|[0.0,0.0]| > |20.273|1.0 |[0.52,1.151] |1.5880852723124176E10|[0.0,0.0]| > |24.199|0.0 |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]| > Can someone please explain this exponential blow up in prediction, as per my > understanding prediction in AFT is a prediction of the time when the failure > event will occur, not able to understand why it will change exponentially > against the value of the label. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm
[ https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153367#comment-16153367 ] Sean Owen commented on SPARK-21919: --- It does look like a problem. From R's survreg I get: {code} survreg(formula = Surv(data$label, data$censor) ~ data$feature1 + data$feature2, dist = "weibull") Value Std. Error zp (Intercept)3.29140 0.295 11.1737 5.49e-29 data$feature1 -0.06581 0.245 -0.2688 7.88e-01 data$feature2 0.00327 0.123 0.0265 9.79e-01 Log(scale)-2.20858 0.642 -3.4390 5.84e-04 Scale= 0.11 {code} [~yanboliang] I think you originally created this; does it ring any bells? > inconsistent behavior of AFTsurvivalRegression algorithm > > > Key: SPARK-21919 > URL: https://issues.apache.org/jira/browse/SPARK-21919 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.2.0 > Environment: Spark Version: 2.2.0 > Cluster setup: Standalone single node > Python version: 3.5.2 >Reporter: Ashish Chopra > > Took the direct example from spark ml documentation. > {code} > training = spark.createDataFrame([ > (1.218, 1.0, Vectors.dense(1.560, -0.605)), > (2.949, 0.0, Vectors.dense(0.346, 2.158)), > (3.627, 0.0, Vectors.dense(1.380, 0.231)), > (0.273, 1.0, Vectors.dense(0.520, 1.151)), > (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result is: > Coefficients: [-0.496304411053,0.198452172529] > Intercept: 2.6380898963056327 > Scale: 1.5472363533632303 > ||label||censor||features ||prediction || quantiles || > |1.218|1.0 |[1.56,-0.605] |5.718985621018951 | > [1.160322990805951,4.99546058340675]| > |2.949|0.0 |[0.346,2.158] |18.07678210850554 > |[3.66759199449632,15.789837303662042]| > |3.627|0.0 |[1.38,0.231] |7.381908879359964 > |[1.4977129086101573,6.4480027195054905]| > |0.273|1.0 |[0.52,1.151] > |13.577717814884505|[2.754778414791513,11.859962351993202]| > |4.199|0.0 |[0.795,-0.226]|9.013087597344805 > |[1.828662187733188,7.8728164067854856]| > But if we change the value of all labels as label + 20. as: > {code} > training = spark.createDataFrame([ > (21.218, 1.0, Vectors.dense(1.560, -0.605)), > (22.949, 0.0, Vectors.dense(0.346, 2.158)), > (23.627, 0.0, Vectors.dense(1.380, 0.231)), > (20.273, 1.0, Vectors.dense(0.520, 1.151)), > (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", > "features"]) > quantileProbabilities = [0.3, 0.6] > aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities, > quantilesCol="quantiles") > #aft = AFTSurvivalRegression() > model = aft.fit(training) > > # Print the coefficients, intercept and scale parameter for AFT survival > regression > print("Coefficients: " + str(model.coefficients)) > print("Intercept: " + str(model.intercept)) > print("Scale: " + str(model.scale)) > model.transform(training).show(truncate=False) > {code} > result changes to: > Coefficients: [23.9932020748,3.18105314757] > Intercept: 7.35052273751137 > Scale: 7698609960.724161 > ||label ||censor||features ||prediction ||quantiles|| > |21.218|1.0 |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]| > |22.949|0.0 |[0.346,2.158] |6.011158613411288E9 |[0.0,0.0]| > |23.627|0.0 |[1.38,0.231] |7.7835948690311181E17|[0.0,0.0]| > |20.273|1.0 |[0.52,1.151] |1.5880852723124176E10|[0.0,0.0]| > |24.199|0.0 |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]| > Can someone please explain this exponential blow up in prediction, as per my > understanding prediction in AFT is a prediction of the time when the failure > event will occur, not able to understand why it will change exponentially > against the value of the label. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org