Hello everyone,
I'm having a usage issue with HashingTF class from Spark MLLIB.
I'm computing TF.IDF on a set of terms/documents which later I'm using to
identify most important ones in each of the input document.
Below is a short code snippet which outlines the example (2 documents with
2 words each, executed on Spark 2.0).
val documentsToEvaluate = sc.parallelize(Array(Seq("Mars",
"Jupiter"),Seq("Venus", "Mars")))
val hashingTF = new HashingTF()
val tf = hashingTF.transform(documentsToEvaluate)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)
documentsToEvaluate.zip(tfidf).saveAsTextFile("/tmp/tfidf")
The computation yields to the following result:
(List(Mars, Jupiter),(1048576,[593437,962819],[0.4054651081081644,0.0]))
(List(Venus, Mars),(1048576,[798918,962819],[0.4054651081081644,0.0]))
My concern is that I can't get a mapping of TF.IDF weights an initial terms
used (i.e. Mars : 0.0, Jupiter : 0.4, Venus : 0.4. You may notice that the
weight and terms indices do not correspond after zipping 2 sequences). I
can only identify the hash (i.e. 593437 : 0.4) mappings.
I know HashingTF uses the hashing trick to compute TF. My question is it
possible to retrieve terms / weights mapping, or HashingTF was not designed
to handle this use-case. If latter, what other implementation of TF.IDF you
may recommend.
I may continue the computation with the (*hash:weight*) tuple, though
getting initial (*term:weight)* would result in a lot easier debugging
steps later down the pipeline.
Any response will be greatly appreciated!
Regards, Sergiu Ciumac