Hi Stuti,
The features should be standardized before training the model. Currently
AFTSurvivalRegression does not support standardization. Here is the work
around for this issue, and I will send a PR to fix this issue soon.
val ovarian = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("......")
.toDF("label", "censor", "age", "resid_ds", "rx", "ecog_ps")
val assembler = new VectorAssembler()
.setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
.setOutputCol("features")
val ovarian2 = assembler.transform(ovarian)
.select(col("censor").cast(DoubleType),
col("label").cast(DoubleType), col("features"))
val standardScaler = new StandardScaler()
.setInputCol("features")
.setOutputCol("standardized_features")
val ssModel = standardScaler.fit(ovarian2)
val ovarian3 = ssModel.transform(ovarian2)
val aft = new
AFTSurvivalRegression().setFeaturesCol("standardized_features")
val model = aft.fit(ovarian3)
val newCoefficients =
model.coefficients.toArray.zip(ssModel.std.toArray).map { x =>
x._1 / x._2
}
println(newCoefficients.toSeq.mkString(","))
println(model.intercept)
println(model.scale)
Yanbo
2016-02-15 16:07 GMT+08:00 Yanbo Liang <[email protected]>:
> Hi Stuti,
>
> This is a bug of AFTSurvivalRegression, we did not handle "lossSum ==
> infinity" properly.
> I have open https://issues.apache.org/jira/browse/SPARK-13322 to track
> this issue and will send a PR.
> Thanks for reporting this issue.
>
> Yanbo
>
> 2016-02-12 15:03 GMT+08:00 Stuti Awasthi <[email protected]>:
>
>> Hi All,
>>
>> Im wanted to try Survival Analysis on Spark 1.6. I am successfully able
>> to run the AFT example provided. Now I tried to train the model with
>> Ovarian data which is standard data comes with Survival library in R.
>>
>> Default Column Name : *Futime,fustat,age,resid_ds,rx,ecog_ps*
>>
>>
>>
>> Here are the steps I have done :
>>
>> · Loaded the data from csv to dataframe labeled as
>>
>> *val* ovarian_data = sqlContext.read
>>
>> .format("com.databricks.spark.csv")
>>
>> .option("header", "true") // Use first line of all files as header
>>
>> .option("inferSchema", "true") // Automatically infer data types
>>
>> .load("Ovarian.csv").toDF("label", "censor", "age", "resid_ds",
>> "rx", "ecog_ps")
>>
>> · Utilize the VectorAssembler() to create features from "age",
>> "resid_ds", "rx", "ecog_ps" like
>>
>> *val* assembler = *new* VectorAssembler()
>>
>> .setInputCols(Array("age", "resid_ds", "rx", "ecog_ps"))
>>
>> .setOutputCol("features")
>>
>>
>>
>> · Then I create a new dataframe with only 3 colums as :
>>
>> *val* training = finalDf.select("label", "censor", "features")
>>
>>
>>
>> · Finally Im passing it to AFT
>>
>> *val* model = aft.fit(training)
>>
>>
>>
>> Im getting the error as :
>>
>> java.lang.AssertionError: *assertion failed: AFTAggregator loss sum is
>> infinity. Error for unknown reason.*
>>
>> at scala.Predef$.assert(*Predef.scala:179*)
>>
>> at org.apache.spark.ml.regression.AFTAggregator.add(
>> *AFTSurvivalRegression.scala:480*)
>>
>> at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:522*)
>>
>> at org.apache.spark.ml.regression.AFTCostFun$$anonfun$5.apply(
>> *AFTSurvivalRegression.scala:521*)
>>
>> at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>> at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(
>> *TraversableOnce.scala:144*)
>>
>> at scala.collection.Iterator$class.foreach(*Iterator.scala:727*)
>>
>>
>>
>> I have tried to print the schema :
>>
>> ()root
>>
>> |-- label: double (nullable = true)
>>
>> |-- censor: double (nullable = true)
>>
>> |-- features: vector (nullable = true)
>>
>>
>>
>> Sample data training looks like
>>
>> [59.0,1.0,[72.3315,2.0,1.0,1.0]]
>>
>> [115.0,1.0,[74.4932,2.0,1.0,1.0]]
>>
>> [156.0,1.0,[66.4658,2.0,1.0,2.0]]
>>
>> [421.0,0.0,[53.3644,2.0,2.0,1.0]]
>>
>> [431.0,1.0,[50.3397,2.0,1.0,1.0]]
>>
>>
>>
>> Im not able to understand about the error, as if I use same data and
>> create the denseVector as given in Sample example of AFT, then code works
>> completely fine. But I would like to read the data from CSV file and then
>> proceed.
>>
>>
>>
>> Please suggest
>>
>>
>>
>> Thanks &Regards
>>
>> Stuti Awasthi
>>
>>
>>
>>
>>
>> ::DISCLAIMER::
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and
>> intended for the named recipient(s) only.
>> E-mail transmission is not guaranteed to be secure or error-free as
>> information could be intercepted, corrupted,
>> lost, destroyed, arrive late or incomplete, or may contain viruses in
>> transmission. The e mail and its contents
>> (with or without referred errors) shall therefore not attach any
>> liability on the originator or HCL or its affiliates.
>> Views or opinions, if any, presented in this email are solely those of
>> the author and may not necessarily reflect the
>> views or opinions of HCL or its affiliates. Any form of reproduction,
>> dissemination, copying, disclosure, modification,
>> distribution and / or publication of this message without the prior
>> written consent of authorized representative of
>> HCL is strictly prohibited. If you have received this email in error
>> please delete it and notify the sender immediately.
>> Before opening any email and/or attachments, please check them for
>> viruses and other defects.
>>
>>
>> ----------------------------------------------------------------------------------------------------------------------------------------------------
>>
>
>