Hello Andrew, few years ago I had the same need and I found this SO's answer <https://stackoverflow.com/a/36306784/898154> the way to go.
Here an extract of my (Scala) code (which was doing other things on top), I have removed the irrelevant parts but without testing it, so it might not work out of the box, nonetheless it should help you starting: private def getEncodedVectorLookupTable(df: DataFrame, featuresColName: String): > Map[Long, String] = { val meta = df.select(featuresColName) > .schema.fields.head.metadata > .getMetadata("ml_attr") > .getMetadata("attrs") > /* REFLECTION START */ > val field = meta.getClass.getDeclaredField("map") > field.setAccessible(true) > val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet > field.setAccessible(false) > /* REFLECTION END */ keys.flatMap( > meta.getMetadataArray(_) > .map(m => m.getLong("idx") -> m.getString("name")) > ).toMap } It looks like there is some support now for achieving this, but I have never tried it: https://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/r/RWrapperUtils.html Best regards, Alessandro On Mon, 28 Oct 2019 at 21:01, Andrew Redd <andrewwr...@gmail.com> wrote: > > Hi All! > > I'm performing an econometric analysis over several billion rows of data > and would like to use the Pyspark SparkML implementation of linear > regression. In the example below I'm trying to interact hour of day and > month of year indicators. The StringIndexer documentation tells you what > it's doing when it's one hot encoding string/factor columns (i.e. taking > out the most/least common value or first/last when sorted alphabetically) > but doesn't allow you to recover your coefficient names. This feels like > such a general case that I must be missing something. How can I get my > column names back post regression to map to coefficient values? Do I need > to basically rebuild the RFormula logic in if this isn't already > implemented? Would be happy to use a different Spark language (Scala/Java > etc. ) if implemented there. > > Thanks in advance > > Andrew > > rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day + > month_of_year + hour_of_day:month_of_year + additional_column", > featuresCol="features", > labelCol="label") > > rform_regression_input = > rform.fit(regression_input).transform(regression_input) > > lr = LinearRegression(featuresCol='features', > labelCol='label', > solver='normal') > > lr_model = lr.fit(rform_regression_input) > coefs = [ *lr_model.coefficients, lr_model.intercept] > > return pd.DataFrame( > {"pvalues": lr_model.summary.pValues, > "tvalues": lr_model.summary.tValues, > "std_errs": lr_model.summary.coefficientStandardErrors, > "coefs": coefs} > ) > >