Hello Andrew,
few years ago I had the same need and I found this SO's answer
<https://stackoverflow.com/a/36306784/898154> the way to go.

Here an extract of my (Scala) code (which was doing other things on top), I
have removed the irrelevant parts but without testing it, so it might not
work out of the box, nonetheless it should help you starting:

   private def getEncodedVectorLookupTable(df: DataFrame,

                                          featuresColName: String):
> Map[Long, String] = {

     val meta = df.select(featuresColName)
>       .schema.fields.head.metadata
>       .getMetadata("ml_attr")
>       .getMetadata("attrs")

>     val field = meta.getClass.getDeclaredField("map")
>     field.setAccessible(true)
>     val keys = field.get(meta).asInstanceOf[Map[String, Any]].keySet
>     field.setAccessible(false)
>     /* REFLECTION END */

>       meta.getMetadataArray(_)
>         .map(m => m.getLong("idx") -> m.getString("name"))
>     ).toMap


It looks like there is some support now for achieving this, but I have
never tried it:

Best regards,

On Mon, 28 Oct 2019 at 21:01, Andrew Redd <andrewwr...@gmail.com> wrote:

> Hi All!
> I'm performing an econometric analysis over several billion rows of data
> and would like to use the Pyspark SparkML implementation of linear
> regression. In the example below I'm trying to interact hour of day and
> month of year indicators. The StringIndexer documentation tells you what
> it's doing when it's one hot encoding string/factor columns (i.e. taking
> out the most/least common value or first/last when sorted alphabetically)
> but doesn't allow you to recover your coefficient names. This feels like
> such a general case that I must be missing something. How can I get my
> column names back post regression to map to coefficient values? Do I need
> to basically rebuild the RFormula logic in if this isn't already
> implemented? Would be happy to use a different Spark language (Scala/Java
> etc. ) if implemented there.
> Thanks in advance
> Andrew
> rform = RFormula(formula="log_outcome ~ log_treatment + hour_of_day +
> month_of_year + hour_of_day:month_of_year + additional_column",
>                  featuresCol="features",
>                  labelCol="label")
>     rform_regression_input =
> rform.fit(regression_input).transform(regression_input)
>     lr = LinearRegression(featuresCol='features',
>                          labelCol='label',
>                          solver='normal')
>     lr_model = lr.fit(rform_regression_input)
>     coefs = [ *lr_model.coefficients, lr_model.intercept]
>     return pd.DataFrame(
>         {"pvalues": lr_model.summary.pValues,
>          "tvalues": lr_model.summary.tValues,
>          "std_errs": lr_model.summary.coefficientStandardErrors,
>          "coefs": coefs}
>     )

