Thanks Yanbo! On Sun, Oct 23, 2016 at 1:57 PM, Yanbo Liang <[email protected]> wrote:
> HashingTF was not designed to handle your case, you can try > CountVectorizer who will keep the original terms as vocabulary for > retrieving. CountVectorizer will compute a global term-to-index map, > which can be expensive for a large corpus and has the risk of OOM. IDF > can accept feature vectors generated by HashingTF or CountVectorizer. > FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf > > Thanks > Yanbo > > On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu <[email protected]> > wrote: > >> Hello everyone, >> >> I'm having a usage issue with HashingTF class from Spark MLLIB. >> >> I'm computing TF.IDF on a set of terms/documents which later I'm using to >> identify most important ones in each of the input document. >> >> Below is a short code snippet which outlines the example (2 documents >> with 2 words each, executed on Spark 2.0). >> >> val documentsToEvaluate = sc.parallelize(Array(Seq("Mars", >> "Jupiter"),Seq("Venus", "Mars"))) >> val hashingTF = new HashingTF() >> val tf = hashingTF.transform(documentsToEvaluate) >> tf.cache() >> val idf = new IDF().fit(tf) >> val tfidf: RDD[Vector] = idf.transform(tf) >> documentsToEvaluate.zip(tfidf).saveAsTextFile("/tmp/tfidf") >> >> The computation yields to the following result: >> >> (List(Mars, Jupiter),(1048576,[593437,962819],[0.4054651081081644,0.0])) >> (List(Venus, Mars),(1048576,[798918,962819],[0.4054651081081644,0.0])) >> >> My concern is that I can't get a mapping of TF.IDF weights an initial >> terms used (i.e. Mars : 0.0, Jupiter : 0.4, Venus : 0.4. You may notice >> that the weight and terms indices do not correspond after zipping 2 >> sequences). I can only identify the hash (i.e. 593437 : 0.4) mappings. >> >> I know HashingTF uses the hashing trick to compute TF. My question is it >> possible to retrieve terms / weights mapping, or HashingTF was not designed >> to handle this use-case. If latter, what other implementation of TF.IDF you >> may recommend. >> >> I may continue the computation with the (*hash:weight*) tuple, though >> getting initial (*term:weight)* would result in a lot easier debugging >> steps later down the pipeline. >> >> Any response will be greatly appreciated! >> >> Regards, Sergiu Ciumac >> > >
