Hi Jatin, HashingTF should be able to solve the memory problem if you use a small feature dimension in HashingTF. Please do not cache the input document, but cache the output from HashingTF and IDF instead. We don't have a label indexer yet, so you need a label to index map to map it to double values, e.g., D1 -> 0.0, D2 -> 1.0, etc. Assuming that the input is an RDD[(label: String, doc: Seq[String])], the code should look like the following:
val docTypeToLabel = Map("D1" -> 0.0, ...) val tf = new HashingTF(); val freqs = input.map(x => (docTypeToLabel(x._1), tf.transform(x._2))).cache() val idf = new IDF() val idfModel = idf.fit(freqs.values) val vectors = freqs.map(x => LabeledPoint(x._1, idfModel.transform(x._2))) val nbModel = NaiveBayes.train(vectors) IDF doesn't provide the filter on the min occurrence, but it is nice to put that option. Please create a JIRA and someone may work on it. Best, Xiangrui On Thu, Sep 18, 2014 at 3:46 AM, jatinpreet <jatinpr...@gmail.com> wrote: > Hi, > > I have been running into memory overflow issues while creating TFIDF vectors > to be used in document classification using MLlib's Naive Baye's > classification implementation. > > http://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ > > Memory overflow and GC issues occur while collecting idfs for all the terms. > To give an idea of scale, I am reading around 615,000(around 4GB of text > data) small sized documents from HBase and running the spark program with 8 > cores and 6GB of executor memory. I have tried increasing the parallelism > level and shuffle memory fraction but to no avail. > > The new TFIDF generation APIs caught my eye in the latest Spark version > 1.1.0. The example given in the official documentation mentions creation of > TFIDF vectors based of Hashing Trick. I want to know if it will solve the > mentioned problem by benefiting from reduced memory consumption. > > Also, the example does not state how to create labeled points for a corpus > of pre-classified document data. For example, my training input looks > something like this, > > DocumentType | Content > ----------------------------------------------------------------- > D1 | This is Doc1 sample. > D1 | This also belongs to Doc1. > D1 | Yet another Doc1 sample. > D2 | Doc2 sample. > D2 | Sample content for Doc2. > D3 | The only sample for Doc3. > D4 | Doc4 sample looks like this. > D4 | This is Doc4 sample content. > > I want to create labeled points from this sample data for training. And once > the Naive Bayes model is created, I generate TFIDFs for the test documents > and predict the document type. > > If the new API can solve my issue, how can I generate labelled points using > the new APIs? An example would be great. > > Also, I have a special requirement of ignoring terms that occur in less than > two documents. This has important implications for the accuracy of my use > case and needs to be accommodated while generating TFIDFs. > > Thanks, > Jatin > > > > ----- > Novice Big Data Programmer > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/New-API-for-TFIDF-generation-in-Spark-1-1-0-tp14543.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org