Since you use two steps (StringIndexer and OneHotEncoder) to encode
categories to Vector, I guess you want to decode the eventual vector into
their original categories.
Suppose you have a DataFrame with only one column named "name", there are
three categories: "b", "a", "c" (ranked by frequency). You can refer the
following code snippets to do encode and decode:
val df = spark.createDataFrame(Seq("a", "b", "c", "b", "a",
"b").map(Tuple1.apply)).toDF("name")
val si = new StringIndexer().setInputCol("name").setOutputCol("indexedName")
val siModel = si.fit(df)
val df2 = siModel.transform(df)
val encoder = new OneHotEncoder()
.setDropLast(false)
.setInputCol("indexedName")
.setOutputCol("encodedName")
val df3 = encoder.transform(df2)
df3.show()
// Decode to get the original categories.
val group = AttributeGroup.fromStructField(df3.schema("encodedName"))
val categories = group.attributes.get.map(_.name.get)
println(categories.mkString(","))
// Output: b,a,c
Thanks
Yanbo
2016-07-14 6:46 GMT-07:00 rachmaninovquartet <[email protected]>:
> or would it be common practice to just retain the original categories in
> another df?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Dense-Vectors-outputs-in-feature-engineering-tp27331p27337.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>