Re: Using TF-IDF from MLlib
FWIW the JIRA I was thinking about is https://issues.apache.org/jira/browse/SPARK-3098 On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know. On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote: Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote: This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) If you run into problems with zipping like this, please report them! Thanks, Joseph On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) If you run into problems with zipping like this, please report them! Thanks, Joseph On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote: This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) If you run into problems with zipping like this, please report them! Thanks, Joseph On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
I vaguely remember that JIRA and AFAIK Matei's point was that the order is not guaranteed *after* a shuffle. If you only use operations like map which preserve partitioning, ordering should be guaranteed from what I know. On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote: Dang I can't seem to find the JIRA now but I am sure we had a discussion with Matei about this and the conclusion was that RDD order is not guaranteed unless a sort is involved. On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote: This was brought up again in https://issues.apache.org/jira/browse/SPARK-6340 so I'll answer one item which was asked about the reliability of zipping RDDs. Basically, it should be reliable, and if it is not, then it should be reported as a bug. This general approach should work (with explicit types to make it clear): val data: RDD[LabeledPoint] = ... val labels: RDD[Double] = data.map(_.label) val features1: RDD[Vector] = data.map(_.features) val features2: RDD[Vector] = new HashingTF(numFeatures=100).transform(features1) val features3: RDD[Vector] = idfModel.transform(features2) val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label, features) = LabeledPoint(label, features)) If you run into problems with zipping like this, please report them! Thanks, Joseph On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote: Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list. 1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
Hopefully the new pipeline API addresses this problem. We have a code example here: https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala -Xiangrui On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote: Here is what I did for this case : https://github.com/andypetrella/tf-idf Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit : Given (label, terms) you can just transform the values to a TF vector, then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can make a LabeledPoint from (label, vector) pairs. Is that what you're looking for? On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote: I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using TF-IDF from MLlib
I found the TF-IDF feature extraction and all the MLlib code that work with pure Vector RDD very difficult to work with due to the lack of ability to associate vector back to the original data. Why can't Spark MLlib support LabeledPoint? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
RE: Using TF-IDF from MLlib
Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD get split and turned into a Seq and fed to TFIDF. You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. Presto! Best regards, Ron
Re: Using TF-IDF from MLlib
Yeah, I initially used zip but I was wondering how reliable it is. I mean, it's the order guaranteed? What if some mode fail, and the data is pulled out from different nodes? And even if it can work, I found this implicit semantic quite uncomfortable, don't you? My0.2c Le ven 21 nov. 2014 15:26, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com a écrit : Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD get split and turned into a Seq and fed to TFIDF. You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. Presto! Best regards, Ron
Using TF-IDF from MLlib
Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable) val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) It looks like I can get the indices of the terms using something like J = wordListRDD.map(w = hashingTF.indexOf(w)) where wordList is an RDD holding the distinct words from the sequence of words used to come up with tf. But how do I do the equivalent of Counts = J.map(j = tf.counts(j)) ? Thanks, Ron
Re: Using TF-IDF from MLlib
/Someone will correct me if I'm wrong./ Actually, TF-IDF scores terms for a given document, an specifically TF. Internally, these things are holding a Vector (hopefully sparsed) representing all the possible words (up to 2²⁰) per document. So each document afer applying TF, will be transformed in a Vector. `indexOf` gives the index in the latter Vector. So you can ask the frequency for all the terms in *a doc* by looping on the doc's terms and ask for the value hold in the vector at the place returned by indexOf. The problem you'll face in this case is that with the current implementation it's hard to retrieve the document back. 'Cause the result you'll have is only RDD[Vector]... so which item in your RDD is actually the document you want? I faced the same problem (for a demo I did at devoxx on the wikipedia data), hence I've updated in a repo the code of TF-IDF to allow it to hold a reference to the original document. https://github.com/andypetrella/TF-IDF If you use this impl (which I need to find some time to integrate in spark :-/ ) you'll can build a pair RDD consisting (Path, Vector) for instance. Then this pair RDD can be search (filter + take) for the doc you need and finally asking for the freq (or even after the tfidf score) HTH andy On Thu Nov 20 2014 at 1:14:24 AM Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com wrote: Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable) val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split( ).toSeq) val hashingTF = new HashingTF() val tf: RDD[Vector] = hashingTF.transform(documents) tf.cache() val idf = new IDF().fit(tf) val tfidf: RDD[Vector] = idf.transform(tf) It looks like I can get the indices of the terms using something like J = wordListRDD.map(w = hashingTF.indexOf(w)) where wordList is an RDD holding the distinct words from the sequence of words used to come up with tf. But how do I do the equivalent of Counts = J.map(j = tf.counts(j)) ? Thanks, Ron