William Zhang created SPARK-22974: ------------------------------------- Summary: CountVectorModel does not attach attributes to output column Key: SPARK-22974 URL: https://issues.apache.org/jira/browse/SPARK-22974 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.1 Reporter: William Zhang
If CountVectorModel transforms columns, the output column will not have attributes attached to them. If later on, those columns are used in Interaction transformer, an exception will be thrown: {quote}"org.apache.spark.SparkException: Vector attributes must be defined for interaction." {quote} To reproduce it: {{import org.apache.spark.ml.feature._ import org.apache.spark.sql.functions._ import org.apache.spark.ml.linalg.{SparseVector, Vector} val df = spark.createDataFrame(Seq( (0, Array("a", "b", "c"), Array("1", "2")), (1, Array("a", "b", "b", "c", "a", "d"), Array("1", "2", "3")) )).toDF("id", "words", "nums") val cvModel: CountVectorizerModel = new CountVectorizer() .setInputCol("nums") .setOutputCol("features2") .setVocabSize(4) .setMinDF(0) .fit(df) ]val cvm = new CountVectorizerModel(Array("a", "b", "c")) .setInputCol("words") .setOutputCol("features1") val df1 = cvm.transform(df) val df2 = cvModel.transform(df1) val interaction = new Interaction().setInputCols(Array("features1", "features2")).setOutputCol("features") val df3 = interaction.transform(df2)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org