Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
hi Xiangrui, I am trying to implement the tfidf as per the instruction you sent in your response to Jatin. I am getting an error in idf step. Here are my steps that run till the last line where the compile fails. val labeledDocs = sc.textFile(title_subcategory) val stopwords =

Re: New API for TFIDF generation in Spark 1.1.0

2014-10-09 Thread nilesh
Did some digging in the documentation. Looks like the IDFModel.transform only accepts RDD as an input, and not individual elements. Is this a bug? I am saying this because HashingTF.transform accepts both RDD as well as vector elements as its input. From your post replying to Jatin, looks like

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-20 Thread jatinpreet
Thanks Xangrui and RJ for the responses. RJ, I have created a Jira for the same. It would be great if you could look into this. Following is the link to the improvement task, https://issues.apache.org/jira/browse/SPARK-3614 Let me know if I can be of any help and please keep me posted! Thanks,

Re: New API for TFIDF generation in Spark 1.1.0

2014-09-19 Thread RJ Nowling
Jatin, If you file the JIRA and don't want to work on it, I'd be happy to step in and take a stab at it. RJ On Thu, Sep 18, 2014 at 4:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in

New API for TFIDF generation in Spark 1.1.0

2014-09-18 Thread jatinpreet
Hi, I have been running into memory overflow issues while creating TFIDF vectors to be used in document classification using MLlib's Naive Baye's classification implementation. http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ Memory