[ https://issues.apache.org/jira/browse/SPARK-25441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-25441. ------------------------------- Resolution: Won't Fix What you have there is already term frequency. If you want to normalize it to some kind of term fraction, you can just make that transformation yourself. > calculate term frequency in CountVectorizer() > --------------------------------------------- > > Key: SPARK-25441 > URL: https://issues.apache.org/jira/browse/SPARK-25441 > Project: Spark > Issue Type: New Feature > Components: ML > Affects Versions: 2.3.1 > Reporter: Xinyong Tian > Priority: Major > > currently CountVectorizer() can not output TF (term frequency). I hope there > will be such option. > TF defined as https://en.m.wikipedia.org/wiki/Tf–idf > > example, > >>> df = spark.createDataFrame( ... [(0, ["a", "b", "c"]), (1, ["a", "b", > >>> "b", "c", "a"])], ... ["label", "raw"]) > >>> cv = CountVectorizer(inputCol="raw", outputCol="vectors") > >>> model = cv.fit(df) > >>> model.transform(df).limit(1).show(truncate=False) > label raw vectors > 0 [a, b, c] (3,[0,1,2],[1.0,1.0,1.0]) > > instead I want > 0 [a, b, c] (3,[0,1,2],[0.33,0.33,0.33]) # ie, each vector > devided by by its sum, here 3, so > sum of new vector will 1,for every > row(document) > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org