William Zhang created SPARK-22974:
-------------------------------------

             Summary: CountVectorModel does not attach attributes to output 
column
                 Key: SPARK-22974
                 URL: https://issues.apache.org/jira/browse/SPARK-22974
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.2.1
            Reporter: William Zhang


If CountVectorModel transforms columns, the output column will not have 
attributes attached to them. If later on, those columns are used in Interaction 
transformer, an exception will be thrown:
{quote}"org.apache.spark.SparkException: Vector attributes must be defined for 
interaction."
{quote}

To reproduce it:
{{import org.apache.spark.ml.feature._
import org.apache.spark.sql.functions._
import org.apache.spark.ml.linalg.{SparseVector, Vector}

val df = spark.createDataFrame(Seq(
  (0, Array("a", "b", "c"), Array("1", "2")),
  (1, Array("a", "b", "b", "c", "a", "d"),  Array("1", "2", "3"))
)).toDF("id", "words", "nums")

val cvModel: CountVectorizerModel = new CountVectorizer()
  .setInputCol("nums")
  .setOutputCol("features2")
  .setVocabSize(4)
  .setMinDF(0)
  .fit(df)

]val cvm = new CountVectorizerModel(Array("a", "b", "c"))
  .setInputCol("words")
  .setOutputCol("features1")
  

val df1 = cvm.transform(df)
val df2 = cvModel.transform(df1)

val interaction = new Interaction().setInputCols(Array("features1", 
"features2")).setOutputCol("features")
val df3  = interaction.transform(df2)}}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to