Re: has any one implemented TF_IDF using ML transformers?

2016-01-24 Thread Yanbo Liang
Hi Andy,
I will take a look at your code after your share it.
Thanks!
Yanbo

2016-01-23 0:18 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:

> Hi Yanbo
>
> I recently code up the trivial example from
> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
>  I
> do not get the same results. I’ll put my code up on github over the weekend
> if anyone is interested
>
> Andy
>
> From: Yanbo Liang <yblia...@gmail.com>
> Date: Tuesday, January 19, 2016 at 1:11 AM
>
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: has any one implemented TF_IDF using ML transformers?
>
> Hi Andy,
>
> The equation to calculate IDF is:
> idf = log((m + 1) / (d(t) + 1))
> you can refer here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L150
>
> The equation to calculate TFIDF is:
> TFIDF=TF * IDF
> you can refer:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/IDF.scala#L226
>
>
> Thanks
> Yanbo
>
> 2016-01-19 7:05 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:
>
>> Hi Yanbo
>>
>> I am using 1.6.0. I am having a hard of time trying to figure out what
>> the exact equation is. I do not know Scala.
>>
>> I took a look a the source code URL  you provide. I do not know Scala
>>
>> override def transform(dataset: DataFrame): DataFrame = {
>> transformSchema(dataset.schema, logging = true)
>> val idf = udf { vec: Vector => idfModel.transform(vec) }
>> dataset.withColumn($(outputCol), idf(col($(inputCol
>> }
>>
>>
>> You mentioned the doc is out of date.
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>>
>> Based on my understanding of the subject matter the equations in the java
>> doc are correct. I could not find anything like the equations in the source
>> code?
>>
>> IDF(t,D)=log|D|+1DF(t,D)+1,
>>
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>>
>>
>> I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite
>> the results do not match equation. (In general the unit test asserts seem
>> incomplete).
>>
>>
>>  I have created several small test example to try and figure out how to
>> use NaiveBase, HashingTF, and IDF. The values of TFIDF,  theta,
>> probabilities , … The result produced by spark not match the published
>> results at
>> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html
>>
>>
>> Kind regards
>>
>> Andy
>>
>> private DataFrame createTrainingData() {
>>
>> // make sure we only use dictionarySize words
>>
>> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(
>>
>> // 0 is Chinese
>>
>> // 1 in notChinese
>>
>> RowFactory.create(0, 0.0, Arrays.asList("Chinese",
>> "Beijing", "Chinese")),
>>
>> RowFactory.create(1, 0.0, Arrays.asList("Chinese",
>> "Chinese", "Shanghai")),
>>
>> RowFactory.create(2, 0.0, Arrays.asList("Chinese",
>> "Macao")),
>>
>> RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan",
>> "Chinese";
>>
>>
>>
>> return createData(rdd);
>>
>> }
>>
>>
>> private DataFrame createData(JavaRDD rdd) {
>>
>> StructField id = null;
>>
>> id = new StructField("id", DataTypes.IntegerType, false,
>> Metadata.empty());
>>
>>
>> StructField label = null;
>>
>> label = new StructField("label", DataTypes.DoubleType, false,
>> Metadata.empty());
>>
>>
>>
>> StructField words = null;
>>
>> words = new StructField("words",
>> DataTypes.createArrayType(DataTypes.StringType), false,
>> Metadata.empty());
>>
>>
>> StructType schema = new StructType(new StructField[] { id, label,
>> words });
>>
>> DataFrame ret = sqlContext.createDataFrame(rdd, schema);
>>
>>
>>
>> return ret;
>>
>> }
>>
>>
>>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>>
>> HashingTF hashingTF = new Has

Re: has any one implemented TF_IDF using ML transformers?

2016-01-22 Thread Andy Davidson
Hi Yanbo

I recently code up the trivial example from
http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classifica
tion-1.html I do not get the same results. I’ll put my code up on github
over the weekend if anyone is interested

Andy

From:  Yanbo Liang <yblia...@gmail.com>
Date:  Tuesday, January 19, 2016 at 1:11 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: has any one implemented TF_IDF using ML transformers?

> Hi Andy,
> 
> The equation to calculate IDF is:
> idf = log((m + 1) / (d(t) + 1))
> you can refer here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp
> ark/mllib/feature/IDF.scala#L150
> 
> The equation to calculate TFIDF is:
> TFIDF=TF * IDF
> you can refer: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp
> ark/mllib/feature/IDF.scala#L226
> 
> 
> Thanks
> Yanbo
> 
> 2016-01-19 7:05 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:
>> Hi Yanbo
>> 
>> I am using 1.6.0. I am having a hard of time trying to figure out what the
>> exact equation is. I do not know Scala.
>> 
>> I took a look a the source code URL  you provide. I do not know Scala
>> 
>>   override def transform(dataset: DataFrame): DataFrame = {
>> transformSchema(dataset.schema, logging = true)
>> val idf = udf { vec: Vector => idfModel.transform(vec) }
>> dataset.withColumn($(outputCol), idf(col($(inputCol
>>   }
>> 
>> 
>> You mentioned the doc is out of date.
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>> 
>> Based on my understanding of the subject matter the equations in the java doc
>> are correct. I could not find anything like the equations in the source code?
>> 
>> IDF(t,D)=log|D|+1DF(t,D)+1,
>> 
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>> 
>> 
>> I found the spark unit test org.apache.spark.mllib.feature.JavaTfIdfSuite the
>> results do not match equation. (In general the unit test asserts seem
>> incomplete). 
>> 
>> 
>>  I have created several small test example to try and figure out how to use
>> NaiveBase, HashingTF, and IDF. The values of TFIDF,  theta, probabilities , …
>> The result produced by spark not match the published results at
>> http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classificat
>> ion-1.html
>> 
>> 
>> Kind regards
>> 
>> Andy 
>> 
>> private DataFrame createTrainingData() {
>> 
>> // make sure we only use dictionarySize words
>> 
>> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(
>> 
>> // 0 is Chinese
>> 
>> // 1 in notChinese
>> 
>> RowFactory.create(0, 0.0, Arrays.asList("Chinese", "Beijing",
>> "Chinese")),
>> 
>> RowFactory.create(1, 0.0, Arrays.asList("Chinese", "Chinese",
>> "Shanghai")),
>> 
>> RowFactory.create(2, 0.0, Arrays.asList("Chinese", "Macao")),
>> 
>> RowFactory.create(3, 1.0, Arrays.asList("Tokyo", "Japan",
>> "Chinese";
>> 
>>
>> 
>> return createData(rdd);
>> 
>> }
>> 
>> 
>> 
>> private DataFrame createData(JavaRDD rdd) {
>> 
>> StructField id = null;
>> 
>> id = new StructField("id", DataTypes.IntegerType, false,
>> Metadata.empty());
>> 
>> 
>> 
>> StructField label = null;
>> 
>> label = new StructField("label", DataTypes.DoubleType, false,
>> Metadata.empty());
>> 
>>
>> 
>> StructField words = null;
>> 
>> words = new StructField("words",
>> DataTypes.createArrayType(DataTypes.StringType), false, Metadata.empty());
>> 
>> 
>> 
>> StructType schema = new StructType(new StructField[] { id, label,
>> words });
>> 
>> DataFrame ret = sqlContext.createDataFrame(rdd, schema);
>> 
>> 
>> 
>> return ret;
>> 
>> }
>> 
>> 
>> 
>>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>> 
>> HashingTF hashingTF = new HashingTF()
>> 
>> .setInputCol("words")
>> 
>>

Re: has any one implemented TF_IDF using ML transformers?

2016-01-19 Thread Yanbo Liang
-++-+---+
>
> |id |label|words   |tf   |idf
>   |
>
>
> +---+-++-+---+
>
> |0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
> |(7,[1,2],[0.0,0.9162907318741551]) |
>
> |1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
> |(7,[1,4],[0.0,0.9162907318741551]) |
>
> |2  |0.0  |[Chinese, Macao]|(7,[1,6],[1.0,1.0])
> |(7,[1,6],[0.0,0.9162907318741551]) |
>
> |3  |1.0  |[Tokyo, Japan, Chinese]
> |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])|
>
>
> +---+-++-+---+
>
>
> Here is the spark test case
>
>
>  @Test
>
>   public void tfIdf() {
>
> // The tests are to check Java compatibility.
>
> HashingTF tf = new HashingTF();
>
> @SuppressWarnings("unchecked")
>
> JavaRDD<List> documents = sc.parallelize(Arrays.asList(
>
>   Arrays.asList("this is a sentence".split(" ")),
>
>   Arrays.asList("this is another sentence".split(" ")),
>
>   Arrays.asList("this is still a sentence".split(" "))), 2);
>
> JavaRDD termFreqs = tf.transform(documents);
>
> termFreqs.collect();
>
> IDF idf = new IDF();
>
> JavaRDD tfIdfs = idf.fit(termFreqs).transform(termFreqs);
>
> List localTfIdfs = tfIdfs.collect();
>
> int indexOfThis = tf.indexOf("this");
>
> System.err.println("AEDWIP: indexOfThis: " + indexOfThis);
>
>
>
> int indexOfSentence = tf.indexOf("sentence");
>
> System.err.println("AEDWIP: indexOfSentence: " + indexOfSentence);
>
>
> int indexOfAnother = tf.indexOf("another");
>
> System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother);
>
>
> for (Vector v: localTfIdfs) {
>
>     System.err.println("AEDWIP: V.toString() " + v.toString());
>
>   Assert.assertEquals(0.0, v.apply(indexOfThis), 1e-15);
>
> }
>
>   }
>
>
> $ mvn test -DwildcardSuites=none
> -Dtest=org.apache.spark.mllib.feature.JavaTfIdfSuite
>
> AEDWIP: indexOfThis: 413342
>
> AEDWIP: indexOfSentence: 251491
>
> AEDWIP: indexOfAnother: 263939
>
> AEDWIP: V.toString()
> (1048576,[97,3370,251491,413342],[0.28768207245178085,0.0,0.0,0.0])
>
> AEDWIP: V.toString()
> (1048576,[3370,251491,263939,413342],[0.0,0.0,0.6931471805599453,0.0])
>
> AEDWIP: V.toString()
> (1048576,[97,3370,251491,413342,713128],[0.28768207245178085,0.0,0.0,0.0,0.6931471805599453])
>
> Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.908 sec
> - in org.apache.spark.mllib.feature.JavaTfIdfSuite
>
> From: Yanbo Liang <yblia...@gmail.com>
> Date: Sunday, January 17, 2016 at 12:34 AM
> To: Andrew Davidson <a...@santacruzintegration.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: has any one implemented TF_IDF using ML transformers?
>
> Hi Andy,
>
> Actually, the output of ML IDF model is the TF-IDF vector of each instance
> rather than IDF vector.
> So it's unnecessary to do member wise multiplication to calculate TF-IDF
> value. You can refer the code at here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121
> I found the document of IDF is not very clear, we need to update it.
>
> Thanks
> Yanbo
>
> 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:
>
>> I wonder if I am missing something? TF-IDF is very popular. Spark ML has
>> a lot of transformers how ever it TF_IDF is not supported directly.
>>
>> Spark provide a HashingTF and IDF transformer. The java doc
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>>
>> Mentions you can implement TFIDF as follows
>>
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>>
>> The problem I am running into is both HashingTF and IDF return a sparse
>> vector.
>>
>> *Ideally the spark code  to implement TFIDF would be one line.*
>>
>>
>> * DataFrame ret = tmp.withColumn("features", 
>> tmp.col("tf").multiply(tmp.col("idf")));*
>>
>&g

Re: has any one implemented TF_IDF using ML transformers?

2016-01-18 Thread Andy Davidson
hecked")

JavaRDD<List> documents = sc.parallelize(Arrays.asList(

  Arrays.asList("this is a sentence".split(" ")),

  Arrays.asList("this is another sentence".split(" ")),

  Arrays.asList("this is still a sentence".split(" "))), 2);

JavaRDD termFreqs = tf.transform(documents);

termFreqs.collect();

IDF idf = new IDF();

JavaRDD tfIdfs = idf.fit(termFreqs).transform(termFreqs);

List localTfIdfs = tfIdfs.collect();

int indexOfThis = tf.indexOf("this");

System.err.println("AEDWIP: indexOfThis: " + indexOfThis);



int indexOfSentence = tf.indexOf("sentence");

System.err.println("AEDWIP: indexOfSentence: " + indexOfSentence);



int indexOfAnother = tf.indexOf("another");

System.err.println("AEDWIP: indexOfAnother: " + indexOfAnother);



for (Vector v: localTfIdfs) {

System.err.println("AEDWIP: V.toString() " + v.toString());

  Assert.assertEquals(0.0, v.apply(indexOfThis), 1e-15);

}

  }



$ mvn test -DwildcardSuites=none
-Dtest=org.apache.spark.mllib.feature.JavaTfIdfSuite


AEDWIP: indexOfThis: 413342

AEDWIP: indexOfSentence: 251491

AEDWIP: indexOfAnother: 263939

AEDWIP: V.toString()
(1048576,[97,3370,251491,413342],[0.28768207245178085,0.0,0.0,0.0])

AEDWIP: V.toString()
(1048576,[3370,251491,263939,413342],[0.0,0.0,0.6931471805599453,0.0])

AEDWIP: V.toString()
(1048576,[97,3370,251491,413342,713128],[0.28768207245178085,0.0,0.0,0.0,0.6
931471805599453])

Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.908 sec -
in org.apache.spark.mllib.feature.JavaTfIdfSuite


From:  Yanbo Liang <yblia...@gmail.com>
Date:  Sunday, January 17, 2016 at 12:34 AM
To:  Andrew Davidson <a...@santacruzintegration.com>
Cc:  "user @spark" <user@spark.apache.org>
Subject:  Re: has any one implemented TF_IDF using ML transformers?

> Hi Andy,
> 
> Actually, the output of ML IDF model is the TF-IDF vector of each instance
> rather than IDF vector.
> So it's unnecessary to do member wise multiplication to calculate TF-IDF
> value. You can refer the code at here:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/sp
> ark/ml/feature/IDF.scala#L121
> I found the document of IDF is not very clear, we need to update it.
> 
> Thanks
> Yanbo
> 
> 2016-01-16 6:10 GMT+08:00 Andy Davidson <a...@santacruzintegration.com>:
>> I wonder if I am missing something? TF-IDF is very popular. Spark ML has a
>> lot of transformers how ever it TF_IDF is not supported directly.
>> 
>> Spark provide a HashingTF and IDF transformer. The java doc
>> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>> 
>> Mentions you can implement TFIDF as follows
>> 
>> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>> 
>> The problem I am running into is both HashingTF and IDF return a sparse
>> vector.
>> 
>> Ideally the spark code  to implement TFIDF would be one line.
>> 
>>  DataFrame ret = tmp.withColumn("features",
>> tmp.col("tf").multiply(tmp.col("idf")));
>> 
>> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
>> data type mismatch: '(tf * idf)' requires numeric type, not vector;
>> 
>> I could implement my own UDF to do member wise multiplication how ever given
>> how common TF-IDF is I wonder if this code already exists some where
>> 
>> I found  org.apache.spark.util.Vector.Multiplier. There is no documentation
>> how ever give the argument is double, my guess is it just does scalar
>> multiplication. 
>> 
>> I guess I could do something like
>> 
>> Double[] v = mySparkVector.toArray();
>>  Then use JBlas to do member wise multiplication
>> 
>> I assume sparceVectors are not distributed so there  would not be any
>> additional communication cost
>> 
>> 
>> If this code is truly missing. I would be happy to write it and donate it
>> 
>> Andy
>> 
>> 
>> From:  Andrew Davidson <a...@santacruzintegration.com>
>> Date:  Wednesday, January 13, 2016 at 2:52 PM
>> To:  "user @spark" <user@spark.apache.org>
>> Subject:  trouble calculating TF-IDF data type mismatch: '(tf * idf)'
>> requires numeric type, not vector;
>> 
>>> Bellow is a little snippet of my Java Test Code. Any idea how I implement
>>> member wise vector multiplication?
>>> 
>>> Kind regards
>>> 
>>> Andy
>>> 
>>> transformed df printSchema()
>>> 
>>> root
>>> 
>>>  |-- id: integer (nullable = false)

Re: has any one implemented TF_IDF using ML transformers?

2016-01-17 Thread Yanbo Liang
Hi Andy,

Actually, the output of ML IDF model is the TF-IDF vector of each instance
rather than IDF vector.
So it's unnecessary to do member wise multiplication to calculate TF-IDF
value. You can refer the code at here:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/IDF.scala#L121
I found the document of IDF is not very clear, we need to update it.

Thanks
Yanbo

2016-01-16 6:10 GMT+08:00 Andy Davidson :

> I wonder if I am missing something? TF-IDF is very popular. Spark ML has a
> lot of transformers how ever it TF_IDF is not supported directly.
>
> Spark provide a HashingTF and IDF transformer. The java doc
> http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf
>
> Mentions you can implement TFIDF as follows
>
> TFIDF(t,d,D)=TF(t,d)・IDF(t,D).
>
> The problem I am running into is both HashingTF and IDF return a sparse
> vector.
>
> *Ideally the spark code  to implement TFIDF would be one line.*
>
>
> * DataFrame ret = tmp.withColumn("features", 
> tmp.col("tf").multiply(tmp.col("idf")));*
>
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' requires numeric type, not vector;
>
> I could implement my own UDF to do member wise multiplication how ever
> given how common TF-IDF is I wonder if this code already exists some where
>
> I found  org.apache.spark.util.Vector.Multiplier. There is no
> documentation how ever give the argument is double, my guess is it just
> does scalar multiplication.
>
> I guess I could do something like
>
> Double[] v = mySparkVector.toArray();
>  Then use JBlas to do member wise multiplication
>
> I assume sparceVectors are not distributed so there  would not be any
> additional communication cost
>
>
> If this code is truly missing. I would be happy to write it and donate it
>
> Andy
>
>
> From: Andrew Davidson 
> Date: Wednesday, January 13, 2016 at 2:52 PM
> To: "user @spark" 
> Subject: trouble calculating TF-IDF data type mismatch: '(tf * idf)'
> requires numeric type, not vector;
>
> Bellow is a little snippet of my Java Test Code. Any idea how I implement
> member wise vector multiplication?
>
> Kind regards
>
> Andy
>
> transformed df printSchema()
>
> root
>
>  |-- id: integer (nullable = false)
>
>  |-- label: double (nullable = false)
>
>  |-- words: array (nullable = false)
>
>  ||-- element: string (containsNull = true)
>
>  |-- tf: vector (nullable = true)
>
>  |-- idf: vector (nullable = true)
>
>
>
> +---+-++-+---+
>
> |id |label|words   |tf   |idf
>   |
>
>
> +---+-++-+---+
>
> |0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
> |(7,[1,2],[0.0,0.9162907318741551]) |
>
> |1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
> |(7,[1,4],[0.0,0.9162907318741551]) |
>
> |2  |0.0  |[Chinese, Macao]|(7,[1,6],[1.0,1.0])
> |(7,[1,6],[0.0,0.9162907318741551]) |
>
> |3  |1.0  |[Tokyo, Japan, Chinese]
> |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.9162907318741551])|
>
>
> +---+-++-+---+
>
> @Test
>
> public void test() {
>
> DataFrame rawTrainingDF = createTrainingData();
>
> DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF);
>
> . . .
>
> }
>
>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
>
> HashingTF hashingTF = new HashingTF()
>
> .setInputCol("words")
>
> .setOutputCol("tf")
>
> .setNumFeatures(dictionarySize);
>
>
>
> DataFrame termFrequenceDF = hashingTF.transform(rawDF);
>
>
>
> termFrequenceDF.cache(); // idf needs to make 2 passes over data
> set
>
> IDFModel idf = new IDF()
>
> //.setMinDocFreq(1) // our vocabulary has 6 words
> we hash into 7
>
> .setInputCol(hashingTF.getOutputCol())
>
> .setOutputCol("idf")
>
> .fit(termFrequenceDF);
>
>
> DataFrame tmp = idf.transform(termFrequenceDF);
>
>
>
> DataFrame ret = tmp.withColumn("features", tmp.col("tf").multiply(
> tmp.col("idf")));
>
> logger.warn("\ntransformed df printSchema()");
>
> ret.printSchema();
>
> ret.show(false);
>
>
>
> return ret;
>
> }
>
>
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' 

has any one implemented TF_IDF using ML transformers?

2016-01-15 Thread Andy Davidson
I wonder if I am missing something? TF-IDF is very popular. Spark ML has a
lot of transformers how ever it TF_IDF is not supported directly.

Spark provide a HashingTF and IDF transformer. The java doc
http://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf

Mentions you can implement TFIDF as follows

TFIDF(t,d,D)=TF(t,d)・IDF(t,D).

The problem I am running into is both HashingTF and IDF return a sparse
vector.

Ideally the spark code  to implement TFIDF would be one line.

 DataFrame ret = tmp.withColumn("features",
tmp.col("tf").multiply(tmp.col("idf")));

org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
data type mismatch: '(tf * idf)' requires numeric type, not vector;

I could implement my own UDF to do member wise multiplication how ever given
how common TF-IDF is I wonder if this code already exists some where

I found  org.apache.spark.util.Vector.Multiplier. There is no documentation
how ever give the argument is double, my guess is it just does scalar
multiplication. 

I guess I could do something like

Double[] v = mySparkVector.toArray();
 Then use JBlas to do member wise multiplication

I assume sparceVectors are not distributed so there  would not be any
additional communication cost


If this code is truly missing. I would be happy to write it and donate it

Andy


From:  Andrew Davidson 
Date:  Wednesday, January 13, 2016 at 2:52 PM
To:  "user @spark" 
Subject:  trouble calculating TF-IDF data type mismatch: '(tf * idf)'
requires numeric type, not vector;

> Bellow is a little snippet of my Java Test Code. Any idea how I implement
> member wise vector multiplication?
> 
> Kind regards
> 
> Andy
> 
> transformed df printSchema()
> 
> root
> 
>  |-- id: integer (nullable = false)
> 
>  |-- label: double (nullable = false)
> 
>  |-- words: array (nullable = false)
> 
>  ||-- element: string (containsNull = true)
> 
>  |-- tf: vector (nullable = true)
> 
>  |-- idf: vector (nullable = true)
> 
> 
> 
> +---+-++-+
> ---+
> 
> |id |label|words   |tf   |idf
> |
> 
> +---+-++-+
> ---+
> 
> |0  |0.0  |[Chinese, Beijing, Chinese] |(7,[1,2],[2.0,1.0])
> |(7,[1,2],[0.0,0.9162907318741551]) |
> 
> |1  |0.0  |[Chinese, Chinese, Shanghai]|(7,[1,4],[2.0,1.0])
> |(7,[1,4],[0.0,0.9162907318741551]) |
> 
> |2  |0.0  |[Chinese, Macao]|(7,[1,6],[1.0,1.0])
> |(7,[1,6],[0.0,0.9162907318741551]) |
> 
> |3  |1.0  |[Tokyo, Japan, Chinese]
> |(7,[1,3,5],[1.0,1.0,1.0])|(7,[1,3,5],[0.0,0.9162907318741551,0.91629073187415
> 51])|
> 
> +---+-++-+
> ---+
> 
> 
> @Test
> 
> public void test() {
> 
> DataFrame rawTrainingDF = createTrainingData();
> 
> DataFrame trainingDF = runPipleLineTF_IDF(rawTrainingDF);
> 
> . . .
> 
> }
> 
>private DataFrame runPipleLineTF_IDF(DataFrame rawDF) {
> 
> HashingTF hashingTF = new HashingTF()
> 
> .setInputCol("words")
> 
> .setOutputCol("tf")
> 
> .setNumFeatures(dictionarySize);
> 
> 
> 
> DataFrame termFrequenceDF = hashingTF.transform(rawDF);
> 
> 
> 
> termFrequenceDF.cache(); // idf needs to make 2 passes over data set
> 
> IDFModel idf = new IDF()
> 
> //.setMinDocFreq(1) // our vocabulary has 6 words we
> hash into 7
> 
> .setInputCol(hashingTF.getOutputCol())
> 
> .setOutputCol("idf")
> 
> .fit(termFrequenceDF);
> 
> 
> 
> DataFrame tmp = idf.transform(termFrequenceDF);
> 
> 
> 
> DataFrame ret = tmp.withColumn("features",
> tmp.col("tf").multiply(tmp.col("idf")));
> 
> logger.warn("\ntransformed df printSchema()");
> 
> ret.printSchema();
> 
> ret.show(false);
> 
> 
> 
> return ret;
> 
> }
> 
> 
> 
> org.apache.spark.sql.AnalysisException: cannot resolve '(tf * idf)' due to
> data type mismatch: '(tf * idf)' requires numeric type, not vector;
> 
> 
> 
> 
> 
> private DataFrame createTrainingData() {
> 
> // make sure we only use dictionarySize words
> 
> JavaRDD rdd = javaSparkContext.parallelize(Arrays.asList(
> 
> // 0 is Chinese
> 
> // 1 in notChinese
> 
> RowFactory.create(0, 0.0, Arrays.asList("Chinese", "Beijing",
> "Chinese")),
> 
> RowFactory.create(1, 0.0, Arrays.asList("Chinese", "Chinese",
> "Shanghai")),
> 
>