Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
the values to a TF >>>> vector, >>>> >> then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can >>>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're >>>> >> looking for? >>>> >>

Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
IDF / IDFModel. Then you can >>> >> make a LabeledPoint from (label, vector) pairs. Is that what you're >>> >> looking for? >>> >> >>> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao wrote: >>> >> > I found the TF-IDF

Re: Using TF-IDF from MLlib

2015-03-16 Thread Sean Owen
; >> >> On Mon, Dec 29, 2014 at 3:37 AM, Yao wrote: >> >> > I found the TF-IDF feature extraction and all the MLlib code that >> work >> >> > with >> >> > pure Vector RDD very difficult to work with due to the lack of >> ability &g

Re: Using TF-IDF from MLlib

2015-03-16 Thread Joseph Bradley
ode that work > >> > with > >> > pure Vector RDD very difficult to work with due to the lack of ability > >> > to > >> > associate vector back to the original data. Why can't Spark MLlib > >> > support > >> > LabeledPoi

Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng
ery difficult to work with due to the lack of ability >> > to >> > associate vector back to the original data. Why can't Spark MLlib >> > support >> > LabeledPoint? >> > >> > >> > >> > -- >> > View this message in co

Re: Using TF-IDF from MLlib

2014-12-29 Thread andy petrella
ty to > > associate vector back to the original data. Why can't Spark MLlib support > > LabeledPoint? > > > > > > > > -- > > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp1942

Re: Using TF-IDF from MLlib

2014-12-29 Thread Sean Owen
http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr..

Re: Using TF-IDF from MLlib

2014-12-28 Thread Yao
spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional com

Re: Using TF-IDF from MLlib

2014-11-21 Thread andy petrella
Yeah, I initially used zip but I was wondering how reliable it is. I mean, it's the order guaranteed? What if some mode fail, and the data is pulled out from different nodes? And even if it can work, I found this implicit semantic quite uncomfortable, don't you? My0.2c Le ven 21 nov. 2014 15:26,

RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help. One thing - I think you can figure out which document is responsible for which vector without checking in more code. Start with a PairRDD of [doc_id, doc_string] for each document and split that into one RDD for each column. The values in the doc_string RDD

Re: Using TF-IDF from MLlib

2014-11-20 Thread andy petrella
/Someone will correct me if I'm wrong./ Actually, TF-IDF scores terms for a given document, an specifically TF. Internally, these things are holding a Vector (hopefully sparsed) representing all the possible words (up to 2²⁰) per document. So each document afer applying TF, will be transformed in

Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all, I want to try the TF-IDF functionality in MLlib. I can feed it words and generate the tf and idf RDD[Vector]s, using the code below. But how do I get this back to words and their counts and tf-idf values for presentation? val sentsTmp = sqlContext.sql("SELECT text FROM sentenceTable")