Re: HashingTF for TF.IDF computation

Ciumac Sergiu Sun, 23 Oct 2016 05:14:03 -0700

Thanks Yanbo!

On Sun, Oct 23, 2016 at 1:57 PM, Yanbo Liang <[email protected]> wrote:


> HashingTF was not designed to handle your case, you can try
> CountVectorizer who will keep the original terms as vocabulary for
> retrieving. CountVectorizer will compute a global term-to-index map,
> which can be expensive for a large corpus and has the risk of OOM. IDF
> can accept feature vectors generated by HashingTF or CountVectorizer.
> FYI http://spark.apache.org/docs/latest/ml-features.html#tf-idf
>
> Thanks
> Yanbo
>
> On Thu, Oct 20, 2016 at 10:00 AM, Ciumac Sergiu <[email protected]>
> wrote:
>
>> Hello everyone,
>>
>> I'm having a usage issue with HashingTF class from Spark MLLIB.
>>
>> I'm computing TF.IDF on a set of terms/documents which later I'm using to
>> identify most important ones in each of the input document.
>>
>> Below is a short code snippet which outlines the example (2 documents
>> with 2 words each, executed on Spark 2.0).
>>
>> val documentsToEvaluate = sc.parallelize(Array(Seq("Mars", 
>> "Jupiter"),Seq("Venus", "Mars")))
>> val hashingTF = new HashingTF()
>> val tf = hashingTF.transform(documentsToEvaluate)
>> tf.cache()
>> val idf = new IDF().fit(tf)
>> val tfidf: RDD[Vector] = idf.transform(tf)
>> documentsToEvaluate.zip(tfidf).saveAsTextFile("/tmp/tfidf")
>>
>> The computation yields to the following result:
>>
>> (List(Mars, Jupiter),(1048576,[593437,962819],[0.4054651081081644,0.0]))
>> (List(Venus, Mars),(1048576,[798918,962819],[0.4054651081081644,0.0]))
>>
>> My concern is that I can't get a mapping of TF.IDF weights an initial
>> terms used (i.e. Mars : 0.0, Jupiter : 0.4, Venus : 0.4. You may notice
>> that the weight and terms indices do not correspond after zipping 2
>> sequences). I can only identify the hash (i.e. 593437 : 0.4) mappings.
>>
>> I know HashingTF uses the hashing trick to compute TF. My question is it
>> possible to retrieve terms / weights mapping, or HashingTF was not designed
>> to handle this use-case. If latter, what other implementation of TF.IDF you
>> may recommend.
>>
>> I may continue the computation with the (*hash:weight*) tuple, though
>> getting initial (*term:weight)* would result in a lot easier debugging
>> steps later down the pipeline.
>>
>> Any response will be greatly appreciated!
>>
>> Regards, Sergiu Ciumac
>>
>
>

Re: HashingTF for TF.IDF computation

Reply via email to