Re: has any one implemented TF_IDF using ML transformers?
Hi Andy, I will take a look at your code after your share it. Thanks! Yanbo 2016-01-23 0:18 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > Hi Yanbo > > I recently code up the trivial example from > http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html > I > do not get the same results. I’ll put my code up on github over the weekend > if anyone is interested > > Andy > > From: Yanbo Liang <yblia...@gmail.com> > Date: Tuesday, January 19, 2016 at 1:11 AM > > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: has any one implemented TF_IDF using ML transformers? > > Hi Andy, > > The equation to calculate IDF is: > idf = log((m + 1) / (d(t) + 1)) > you can refer here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L150 > > The equation to calculate TFIDF is: > TFIDF=TF * IDF > you can refer: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L226 > > > Thanks > Yanbo > > 2016-01-19 7:05 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > >> Hi Yanbo >> >> I am using 1.6.0. I am having a hard of time trying to figure out what >> the exact equation is. I do not know Scala. >> >> I took a look a the source code URL you provide. I do not know Scala >> >> override def transform(dataset: DataFrame): DataFrame = { >> transformSchema(dataset.schema, logging = true) >> val idf = udf { vec: Vector => idfModel.transform(vec) } >> dataset.withColumn($(outputCol), idf(col($(inputCol >> } >> >> >> You mentioned the doc is out of date. >> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf >> >> Based on my understanding of the subject matter the equations in the java >> doc are correct. I could not find anything like the equations in the source >> code? >> >> IDF(t,D)=log|D|+1DF(t,D)+1, >> >> TFIDF(t,d,D)=TF(t,d)・IDF(t,D). >> >> >> I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite >> the results do not match equation. (In general the unit test asserts seem >> incomplete). >> >> >> I have created several small test example to try and figure out how to >> use NaiveBase, HashingTF, and IDF. The values of TFIDF, theta, >> probabilities , … The result produced by spark not match the published >> results at >> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html >> >> >> Kind regards >> >> Andy >> >> private DataFrame createTrainingData() { >> >> // make sure we only use dictionarySize words >> >> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList( >> >> // 0 is Chinese >> >> // 1 in notChinese >> >> RowFactory.create(0, 0.0, Arrays.asList("Chinese", >> "Beijing", "Chinese")), >> >> RowFactory.create(1, 0.0, Arrays.asList("Chinese", >> "Chinese", "Shanghai")), >> >> RowFactory.create(2, 0.0, Arrays.asList("Chinese", >> "Macao")), >> >> RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan", >> "Chinese"; >> >> >> >> return createData(rdd); >> >> } >> >> >> private DataFrame createData(JavaRDD rdd) { >> >> StructField id = null; >> >> id = new StructField("id", DataTypes.IntegerType, false, >> Metadata.empty()); >> >> >> StructField label = null; >> >> label = new StructField("label", DataTypes.DoubleType, false, >> Metadata.empty()); >> >> >> >> StructField words = null; >> >> words = new StructField("words", >> DataTypes.createArrayType(DataTypes.StringType), false, >> Metadata.empty()); >> >> >> StructType schema = new StructType(new StructField[] { id, label, >> words }); >> >> DataFrame ret = sqlContext.createDataFrame(rdd, schema); >> >> >> >> return ret; >> >> } >> >> >>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) { >> >> HashingTF hashingTF = new Has
Re: has any one implemented TF_IDF using ML transformers?
Hi Yanbo I recently code up the trivial example from http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica tion-1.html I do not get the same results. I’ll put my code up on github over the weekend if anyone is interested Andy From: Yanbo Liang <yblia...@gmail.com> Date: Tuesday, January 19, 2016 at 1:11 AM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: has any one implemented TF_IDF using ML transformers? > Hi Andy, > > The equation to calculate IDF is: > idf = log((m + 1) / (d(t) + 1)) > you can refer here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp > ark/mllib/feature/IDF.scala#L150 > > The equation to calculate TFIDF is: > TFIDF=TF * IDF > you can refer: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp > ark/mllib/feature/IDF.scala#L226 > > > Thanks > Yanbo > > 2016-01-19 7:05 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: >> Hi Yanbo >> >> I am using 1.6.0. I am having a hard of time trying to figure out what the >> exact equation is. I do not know Scala. >> >> I took a look a the source code URL you provide. I do not know Scala >> >> override def transform(dataset: DataFrame): DataFrame = { >> transformSchema(dataset.schema, logging = true) >> val idf = udf { vec: Vector => idfModel.transform(vec) } >> dataset.withColumn($(outputCol), idf(col($(inputCol >> } >> >> >> You mentioned the doc is out of date. >> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf >> >> Based on my understanding of the subject matter the equations in the java doc >> are correct. I could not find anything like the equations in the source code? >> >> IDF(t,D)=log|D|+1DF(t,D)+1, >> >> TFIDF(t,d,D)=TF(t,d)・IDF(t,D). >> >> >> I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite the >> results do not match equation. (In general the unit test asserts seem >> incomplete). >> >> >> I have created several small test example to try and figure out how to use >> NaiveBase, HashingTF, and IDF. The values of TFIDF, theta, probabilities , … >> The result produced by spark not match the published results at >> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classificat >> ion-1.html >> >> >> Kind regards >> >> Andy >> >> private DataFrame createTrainingData() { >> >> // make sure we only use dictionarySize words >> >> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList( >> >> // 0 is Chinese >> >> // 1 in notChinese >> >> RowFactory.create(0, 0.0, Arrays.asList("Chinese", "Beijing", >> "Chinese")), >> >> RowFactory.create(1, 0.0, Arrays.asList("Chinese", "Chinese", >> "Shanghai")), >> >> RowFactory.create(2, 0.0, Arrays.asList("Chinese", "Macao")), >> >> RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan", >> "Chinese"; >> >> >> >> return createData(rdd); >> >> } >> >> >> >> private DataFrame createData(JavaRDD rdd) { >> >> StructField id = null; >> >> id = new StructField("id", DataTypes.IntegerType, false, >> Metadata.empty()); >> >> >> >> StructField label = null; >> >> label = new StructField("label", DataTypes.DoubleType, false, >> Metadata.empty()); >> >> >> >> StructField words = null; >> >> words = new StructField("words", >> DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty()); >> >> >> >> StructType schema = new StructType(new StructField[] { id, label, >> words }); >> >> DataFrame ret = sqlContext.createDataFrame(rdd, schema); >> >> >> >> return ret; >> >> } >> >> >> >>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) { >> >> HashingTF hashingTF = new HashingTF() >> >> .setInputCol("words") >> >>
Re: has any one implemented TF_IDF using ML transformers?
-++-+---+ > > |id |label|words |tf |idf > | > > > +---+-++-+---+ > > |0 |0.0 |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0]) > |(7,[1,2],[0.0,0.9162907318741551]) | > > |1 |0.0 |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0]) > |(7,[1,4],[0.0,0.9162907318741551]) | > > |2 |0.0 |[Chinese, Macao]|(7,[1,6],[1.0,1.0]) > |(7,[1,6],[0.0,0.9162907318741551]) | > > |3 |1.0 |[Tokyo, Japan, Chinese] > |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])| > > > +---+-++-+---+ > > > Here is the spark test case > > > @Test > > public void tfIdf() { > > // The tests are to check Java compatibility. > > HashingTF tf = new HashingTF(); > > @SuppressWarnings("unchecked") > > JavaRDD<List> documents = sc.parallelize(Arrays.asList( > > Arrays.asList("this is a sentence".split(" ")), > > Arrays.asList("this is another sentence".split(" ")), > > Arrays.asList("this is still a sentence".split(" "))), 2); > > JavaRDD termFreqs = tf.transform(documents); > > termFreqs.collect(); > > IDF idf = new IDF(); > > JavaRDD tfIdfs = idf.fit(termFreqs).transform(termFreqs); > > List localTfIdfs = tfIdfs.collect(); > > int indexOfThis = tf.indexOf("this"); > > System.err.println("AEDWIP: indexOfThis: " + indexOfThis); > > > > int indexOfSentence = tf.indexOf("sentence"); > > System.err.println("AEDWIP: indexOfSentence: " + indexOfSentence); > > > int indexOfAnother = tf.indexOf("another"); > > System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother); > > > for (Vector v: localTfIdfs) { > > System.err.println("AEDWIP: V.toString() " + v.toString()); > > Assert.assertEquals(0.0, v.apply(indexOfThis), 1e-15); > > } > > } > > > $ mvn test -DwildcardSuites=none > -Dtest=org.apache.spark.mllib.feature.JavaTfIdfSuite > > AEDWIP: indexOfThis: 413342 > > AEDWIP: indexOfSentence: 251491 > > AEDWIP: indexOfAnother: 263939 > > AEDWIP: V.toString() > (1048576,[97,3370,251491,413342],[0.28768207245178085,0.0,0.0,0.0]) > > AEDWIP: V.toString() > (1048576,[3370,251491,263939,413342],[0.0,0.0,0.6931471805599453,0.0]) > > AEDWIP: V.toString() > (1048576,[97,3370,251491,413342,713128],[0.28768207245178085,0.0,0.0,0.0,0.6931471805599453]) > > Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.908 sec > - in org.apache.spark.mllib.feature.JavaTfIdfSuite > > From: Yanbo Liang <yblia...@gmail.com> > Date: Sunday, January 17, 2016 at 12:34 AM > To: Andrew Davidson <a...@santacruzintegration.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: has any one implemented TF_IDF using ML transformers? > > Hi Andy, > > Actually, the output of ML IDF model is the TF-IDF vector of each instance > rather than IDF vector. > So it's unnecessary to do member wise multiplication to calculate TF-IDF > value. You can refer the code at here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121 > I found the document of IDF is not very clear, we need to update it. > > Thanks > Yanbo > > 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: > >> I wonder if I am missing something? TF-IDF is very popular. Spark ML has >> a lot of transformers how ever it TF_IDF is not supported directly. >> >> Spark provide a HashingTF and IDF transformer. The java doc >> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf >> >> Mentions you can implement TFIDF as follows >> >> TFIDF(t,d,D)=TF(t,d)・IDF(t,D). >> >> The problem I am running into is both HashingTF and IDF return a sparse >> vector. >> >> *Ideally the spark code to implement TFIDF would be one line.* >> >> >> * DataFrame ret = tmp.withColumn("features", >> tmp.col("tf").multiply(tmp.col("idf")));* >> >&g
Re: has any one implemented TF_IDF using ML transformers?
hecked") JavaRDD<List> documents = sc.parallelize(Arrays.asList( Arrays.asList("this is a sentence".split(" ")), Arrays.asList("this is another sentence".split(" ")), Arrays.asList("this is still a sentence".split(" "))), 2); JavaRDD termFreqs = tf.transform(documents); termFreqs.collect(); IDF idf = new IDF(); JavaRDD tfIdfs = idf.fit(termFreqs).transform(termFreqs); List localTfIdfs = tfIdfs.collect(); int indexOfThis = tf.indexOf("this"); System.err.println("AEDWIP: indexOfThis: " + indexOfThis); int indexOfSentence = tf.indexOf("sentence"); System.err.println("AEDWIP: indexOfSentence: " + indexOfSentence); int indexOfAnother = tf.indexOf("another"); System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother); for (Vector v: localTfIdfs) { System.err.println("AEDWIP: V.toString() " + v.toString()); Assert.assertEquals(0.0, v.apply(indexOfThis), 1e-15); } } $ mvn test -DwildcardSuites=none -Dtest=org.apache.spark.mllib.feature.JavaTfIdfSuite AEDWIP: indexOfThis: 413342 AEDWIP: indexOfSentence: 251491 AEDWIP: indexOfAnother: 263939 AEDWIP: V.toString() (1048576,[97,3370,251491,413342],[0.28768207245178085,0.0,0.0,0.0]) AEDWIP: V.toString() (1048576,[3370,251491,263939,413342],[0.0,0.0,0.6931471805599453,0.0]) AEDWIP: V.toString() (1048576,[97,3370,251491,413342,713128],[0.28768207245178085,0.0,0.0,0.0,0.6 931471805599453]) Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.908 sec - in org.apache.spark.mllib.feature.JavaTfIdfSuite From: Yanbo Liang <yblia...@gmail.com> Date: Sunday, January 17, 2016 at 12:34 AM To: Andrew Davidson <a...@santacruzintegration.com> Cc: "user @spark" <user@spark.apache.org> Subject: Re: has any one implemented TF_IDF using ML transformers? > Hi Andy, > > Actually, the output of ML IDF model is the TF-IDF vector of each instance > rather than IDF vector. > So it's unnecessary to do member wise multiplication to calculate TF-IDF > value. You can refer the code at here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp > ark/ml/feature/IDF.scala#L121 > I found the document of IDF is not very clear, we need to update it. > > Thanks > Yanbo > > 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>: >> I wonder if I am missing something? TF-IDF is very popular. Spark ML has a >> lot of transformers how ever it TF_IDF is not supported directly. >> >> Spark provide a HashingTF and IDF transformer. The java doc >> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf >> >> Mentions you can implement TFIDF as follows >> >> TFIDF(t,d,D)=TF(t,d)・IDF(t,D). >> >> The problem I am running into is both HashingTF and IDF return a sparse >> vector. >> >> Ideally the spark code to implement TFIDF would be one line. >> >> DataFrame ret = tmp.withColumn("features", >> tmp.col("tf").multiply(tmp.col("idf"))); >> >> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to >> data type mismatch: '(tf * idf)' requires numeric type, not vector; >> >> I could implement my own UDF to do member wise multiplication how ever given >> how common TF-IDF is I wonder if this code already exists some where >> >> I found org.apache.spark.util.Vector.Multiplier. There is no documentation >> how ever give the argument is double, my guess is it just does scalar >> multiplication. >> >> I guess I could do something like >> >> Double[] v = mySparkVector.toArray(); >> Then use JBlas to do member wise multiplication >> >> I assume sparceVectors are not distributed so there would not be any >> additional communication cost >> >> >> If this code is truly missing. I would be happy to write it and donate it >> >> Andy >> >> >> From: Andrew Davidson <a...@santacruzintegration.com> >> Date: Wednesday, January 13, 2016 at 2:52 PM >> To: "user @spark" <user@spark.apache.org> >> Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)' >> requires numeric type, not vector; >> >>> Bellow is a little snippet of my Java Test Code. Any idea how I implement >>> member wise vector multiplication? >>> >>> Kind regards >>> >>> Andy >>> >>> transformed df printSchema() >>> >>> root >>> >>> |-- id: integer (nullable = false)
Re: has any one implemented TF_IDF using ML transformers?
Hi Andy, Actually, the output of ML IDF model is the TF-IDF vector of each instance rather than IDF vector. So it's unnecessary to do member wise multiplication to calculate TF-IDF value. You can refer the code at here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121 I found the document of IDF is not very clear, we need to update it. Thanks Yanbo 2016-01-16 6:10 GMT+08:00 Andy Davidson: > I wonder if I am missing something? TF-IDF is very popular. Spark ML has a > lot of transformers how ever it TF_IDF is not supported directly. > > Spark provide a HashingTF and IDF transformer. The java doc > http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf > > Mentions you can implement TFIDF as follows > > TFIDF(t,d,D)=TF(t,d)・IDF(t,D). > > The problem I am running into is both HashingTF and IDF return a sparse > vector. > > *Ideally the spark code to implement TFIDF would be one line.* > > > * DataFrame ret = tmp.withColumn("features", > tmp.col("tf").multiply(tmp.col("idf")));* > > org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to > data type mismatch: '(tf * idf)' requires numeric type, not vector; > > I could implement my own UDF to do member wise multiplication how ever > given how common TF-IDF is I wonder if this code already exists some where > > I found org.apache.spark.util.Vector.Multiplier. There is no > documentation how ever give the argument is double, my guess is it just > does scalar multiplication. > > I guess I could do something like > > Double[] v = mySparkVector.toArray(); > Then use JBlas to do member wise multiplication > > I assume sparceVectors are not distributed so there would not be any > additional communication cost > > > If this code is truly missing. I would be happy to write it and donate it > > Andy > > > From: Andrew Davidson > Date: Wednesday, January 13, 2016 at 2:52 PM > To: "user @spark" > Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)' > requires numeric type, not vector; > > Bellow is a little snippet of my Java Test Code. Any idea how I implement > member wise vector multiplication? > > Kind regards > > Andy > > transformed df printSchema() > > root > > |-- id: integer (nullable = false) > > |-- label: double (nullable = false) > > |-- words: array (nullable = false) > > ||-- element: string (containsNull = true) > > |-- tf: vector (nullable = true) > > |-- idf: vector (nullable = true) > > > > +---+-++-+---+ > > |id |label|words |tf |idf > | > > > +---+-++-+---+ > > |0 |0.0 |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0]) > |(7,[1,2],[0.0,0.9162907318741551]) | > > |1 |0.0 |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0]) > |(7,[1,4],[0.0,0.9162907318741551]) | > > |2 |0.0 |[Chinese, Macao]|(7,[1,6],[1.0,1.0]) > |(7,[1,6],[0.0,0.9162907318741551]) | > > |3 |1.0 |[Tokyo, Japan, Chinese] > |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])| > > > +---+-++-+---+ > > @Test > > public void test() { > > DataFrame rawTrainingDF = createTrainingData(); > > DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF); > > . . . > > } > >private DataFrame runPipleLineTF_IDF(DataFrame rawDF) { > > HashingTF hashingTF = new HashingTF() > > .setInputCol("words") > > .setOutputCol("tf") > > .setNumFeatures(dictionarySize); > > > > DataFrame termFrequenceDF = hashingTF.transform(rawDF); > > > > termFrequenceDF.cache(); // idf needs to make 2 passes over data > set > > IDFModel idf = new IDF() > > //.setMinDocFreq(1) // our vocabulary has 6 words > we hash into 7 > > .setInputCol(hashingTF.getOutputCol()) > > .setOutputCol("idf") > > .fit(termFrequenceDF); > > > DataFrame tmp = idf.transform(termFrequenceDF); > > > > DataFrame ret = tmp.withColumn("features", tmp.col("tf").multiply( > tmp.col("idf"))); > > logger.warn("\ntransformed df printSchema()"); > > ret.printSchema(); > > ret.show(false); > > > > return ret; > > } > > > org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to > data type mismatch: '(tf * idf)'
has any one implemented TF_IDF using ML transformers?
I wonder if I am missing something? TF-IDF is very popular. Spark ML has a lot of transformers how ever it TF_IDF is not supported directly. Spark provide a HashingTF and IDF transformer. The java doc http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf Mentions you can implement TFIDF as follows TFIDF(t,d,D)=TF(t,d)・IDF(t,D). The problem I am running into is both HashingTF and IDF return a sparse vector. Ideally the spark code to implement TFIDF would be one line. DataFrame ret = tmp.withColumn("features", tmp.col("tf").multiply(tmp.col("idf"))); org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to data type mismatch: '(tf * idf)' requires numeric type, not vector; I could implement my own UDF to do member wise multiplication how ever given how common TF-IDF is I wonder if this code already exists some where I found org.apache.spark.util.Vector.Multiplier. There is no documentation how ever give the argument is double, my guess is it just does scalar multiplication. I guess I could do something like Double[] v = mySparkVector.toArray(); Then use JBlas to do member wise multiplication I assume sparceVectors are not distributed so there would not be any additional communication cost If this code is truly missing. I would be happy to write it and donate it Andy From: Andrew DavidsonDate: Wednesday, January 13, 2016 at 2:52 PM To: "user @spark" Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)' requires numeric type, not vector; > Bellow is a little snippet of my Java Test Code. Any idea how I implement > member wise vector multiplication? > > Kind regards > > Andy > > transformed df printSchema() > > root > > |-- id: integer (nullable = false) > > |-- label: double (nullable = false) > > |-- words: array (nullable = false) > > ||-- element: string (containsNull = true) > > |-- tf: vector (nullable = true) > > |-- idf: vector (nullable = true) > > > > +---+-++-+ > ---+ > > |id |label|words |tf |idf > | > > +---+-++-+ > ---+ > > |0 |0.0 |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0]) > |(7,[1,2],[0.0,0.9162907318741551]) | > > |1 |0.0 |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0]) > |(7,[1,4],[0.0,0.9162907318741551]) | > > |2 |0.0 |[Chinese, Macao]|(7,[1,6],[1.0,1.0]) > |(7,[1,6],[0.0,0.9162907318741551]) | > > |3 |1.0 |[Tokyo, Japan, Chinese] > |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.91629073187415 > 51])| > > +---+-++-+ > ---+ > > > @Test > > public void test() { > > DataFrame rawTrainingDF = createTrainingData(); > > DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF); > > . . . > > } > >private DataFrame runPipleLineTF_IDF(DataFrame rawDF) { > > HashingTF hashingTF = new HashingTF() > > .setInputCol("words") > > .setOutputCol("tf") > > .setNumFeatures(dictionarySize); > > > > DataFrame termFrequenceDF = hashingTF.transform(rawDF); > > > > termFrequenceDF.cache(); // idf needs to make 2 passes over data set > > IDFModel idf = new IDF() > > //.setMinDocFreq(1) // our vocabulary has 6 words we > hash into 7 > > .setInputCol(hashingTF.getOutputCol()) > > .setOutputCol("idf") > > .fit(termFrequenceDF); > > > > DataFrame tmp = idf.transform(termFrequenceDF); > > > > DataFrame ret = tmp.withColumn("features", > tmp.col("tf").multiply(tmp.col("idf"))); > > logger.warn("\ntransformed df printSchema()"); > > ret.printSchema(); > > ret.show(false); > > > > return ret; > > } > > > > org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to > data type mismatch: '(tf * idf)' requires numeric type, not vector; > > > > > > private DataFrame createTrainingData() { > > // make sure we only use dictionarySize words > > JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList( > > // 0 is Chinese > > // 1 in notChinese > > RowFactory.create(0, 0.0, Arrays.asList("Chinese", "Beijing", > "Chinese")), > > RowFactory.create(1, 0.0, Arrays.asList("Chinese", "Chinese", > "Shanghai")), > >