Xinyong Tian created SPARK-25441:
------------------------------------

             Summary: calculate term frequency in CountVectorizer()
                 Key: SPARK-25441
                 URL: https://issues.apache.org/jira/browse/SPARK-25441
             Project: Spark
          Issue Type: New Feature
          Components: ML
    Affects Versions: 2.3.1
            Reporter: Xinyong Tian


currently CountVectorizer() can not output TF (term frequency). I hope there 
will be such option.

TF defined as https://en.m.wikipedia.org/wiki/Tf–idf

 

example,

>>> df = spark.createDataFrame( ...  [(0, ["a", "b", "c"]), (1, ["a", "b", "b", 
>>> "c", "a"])], ...  ["label", "raw"])

>>> cv = CountVectorizer(inputCol="raw", outputCol="vectors")

>>> model = cv.fit(df)

>>> model.transform(df).limit(1).show(truncate=False)

label        raw           vectors 

0            [a, b, c]       (3,[0,1,2],[1.0,1.0,1.0])

 

instead I want 

0            [a, b, c]       (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector 
devided by by its sum, here 3, so                                               
                                  sum of new vector will 1,for every 
row(document)

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to