hi Xiangrui,
I am trying to implement the tfidf as per the instruction you sent in
your response to Jatin.
I am getting an error in idf step. Here are my steps that run till the last
line where the compile
fails.
val labeledDocs = sc.textFile(title_subcategory)
val stopwords =
Did some digging in the documentation. Looks like the IDFModel.transform only
accepts RDD as an input,
and not individual elements. Is this a bug? I am saying this because
HashingTF.transform accepts both RDD as well as vector elements as its
input.
From your post replying to Jatin, looks like
Thanks Xangrui and RJ for the responses.
RJ, I have created a Jira for the same. It would be great if you could look
into this. Following is the link to the improvement task,
https://issues.apache.org/jira/browse/SPARK-3614
Let me know if I can be of any help and please keep me posted!
Thanks,
Jatin,
If you file the JIRA and don't want to work on it, I'd be happy to step in
and take a stab at it.
RJ
On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Jatin,
HashingTF should be able to solve the memory problem if you use a
small feature dimension in
Hi,
I have been running into memory overflow issues while creating TFIDF vectors
to be used in document classification using MLlib's Naive Baye's
classification implementation.
http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/
Memory