Hi, I read this document, http://spark.apache.org/docs/1.2.1/mllib-feature-extraction.html, and tried to build a TF-IDF model of my documents.
I have a list of documents, each word is represented as a Int, and each document is listed in one line. doc_name, int1, int2... doc_name, int3, int4... This is how I load my documents: val documents: RDD[Seq[Int]] = sc.objectFile[(String, Seq[Int])](s"$sparkStore/documents") map (_._2) cache() Then I did: val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) val idf = new IDF().fit(tf) val tfidf = idf.transform(tf) I write the tfidf model to a text file and try to understand the structure. FileUtils.writeLines(new File("tfidf.out"), tfidf.collect().toList.asJavaCollection) What I is something like: (1048576,[0,4,7,8,10,13,17,21....],[...some float numbers...]) ... I think it s a tuple with 3 element. - I have no idea what the 1st element is... - I think the 2nd element is a list of the word - I think the 3rd element is a list of tf-idf value of the words in the previous list Please help me understand this structure. Thanks, David