Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
FWIW the JIRA I was thinking about is
https://issues.apache.org/jira/browse/SPARK-3098

On Mon, Mar 16, 2015 at 6:10 PM, Shivaram Venkataraman 
shiva...@eecs.berkeley.edu wrote:

 I vaguely remember that JIRA and AFAIK Matei's point was that the order is
 not guaranteed *after* a shuffle. If you only use operations like map which
 preserve partitioning, ordering should be guaranteed from what I know.

 On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote:

 Dang I can't seem to find the JIRA now but I am sure we had a discussion
 with Matei about this and the conclusion was that RDD order is not
 guaranteed unless a sort is involved.
 On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote:

 This was brought up again in
 https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
 item which was asked about the reliability of zipping RDDs.  Basically, it
 should be reliable, and if it is not, then it should be reported as a bug.
 This general approach should work (with explicit types to make it clear):

 val data: RDD[LabeledPoint] = ...
 val labels: RDD[Double] = data.map(_.label)
 val features1: RDD[Vector] = data.map(_.features)
 val features2: RDD[Vector] = new
 HashingTF(numFeatures=100).transform(features1)
 val features3: RDD[Vector] = idfModel.transform(features2)
 val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
 features) = LabeledPoint(label, features))

 If you run into problems with zipping like this, please report them!

 Thanks,
 Joseph

 On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF
 vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that
 work
   with
   pure Vector RDD very difficult to work with due to the lack of
 ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
 -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: Using TF-IDF from MLlib

2015-03-16 Thread Joseph Bradley
This was brought up again in
https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
which was asked about the reliability of zipping RDDs.  Basically, it
should be reliable, and if it is not, then it should be reported as a bug.
This general approach should work (with explicit types to make it clear):

val data: RDD[LabeledPoint] = ...
val labels: RDD[Double] = data.map(_.label)
val features1: RDD[Vector] = data.map(_.features)
val features2: RDD[Vector] = new
HashingTF(numFeatures=100).transform(features1)
val features3: RDD[Vector] = idfModel.transform(features2)
val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
features) = LabeledPoint(label, features))

If you run into problems with zipping like this, please report them!

Thanks,
Joseph

On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that work
   with
   pure Vector RDD very difficult to work with due to the lack of ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Using TF-IDF from MLlib

2015-03-16 Thread Sean Owen
Dang I can't seem to find the JIRA now but I am sure we had a discussion
with Matei about this and the conclusion was that RDD order is not
guaranteed unless a sort is involved.
On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote:

 This was brought up again in
 https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one item
 which was asked about the reliability of zipping RDDs.  Basically, it
 should be reliable, and if it is not, then it should be reported as a bug.
 This general approach should work (with explicit types to make it clear):

 val data: RDD[LabeledPoint] = ...
 val labels: RDD[Double] = data.map(_.label)
 val features1: RDD[Vector] = data.map(_.features)
 val features2: RDD[Vector] = new
 HashingTF(numFeatures=100).transform(features1)
 val features3: RDD[Vector] = idfModel.transform(features2)
 val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
 features) = LabeledPoint(label, features))

 If you run into problems with zipping like this, please report them!

 Thanks,
 Joseph

 On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that
 work
   with
   pure Vector RDD very difficult to work with due to the lack of
 ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Using TF-IDF from MLlib

2015-03-16 Thread Shivaram Venkataraman
I vaguely remember that JIRA and AFAIK Matei's point was that the order is
not guaranteed *after* a shuffle. If you only use operations like map which
preserve partitioning, ordering should be guaranteed from what I know.

On Mon, Mar 16, 2015 at 6:06 PM, Sean Owen so...@cloudera.com wrote:

 Dang I can't seem to find the JIRA now but I am sure we had a discussion
 with Matei about this and the conclusion was that RDD order is not
 guaranteed unless a sort is involved.
 On Mar 17, 2015 12:14 AM, Joseph Bradley jos...@databricks.com wrote:

 This was brought up again in
 https://issues.apache.org/jira/browse/SPARK-6340  so I'll answer one
 item which was asked about the reliability of zipping RDDs.  Basically, it
 should be reliable, and if it is not, then it should be reported as a bug.
 This general approach should work (with explicit types to make it clear):

 val data: RDD[LabeledPoint] = ...
 val labels: RDD[Double] = data.map(_.label)
 val features1: RDD[Vector] = data.map(_.features)
 val features2: RDD[Vector] = new
 HashingTF(numFeatures=100).transform(features1)
 val features3: RDD[Vector] = idfModel.transform(features2)
 val finalData: RDD[LabeledPoint] = labels.zip(features3).map((label,
 features) = LabeledPoint(label, features))

 If you run into problems with zipping like this, please report them!

 Thanks,
 Joseph

 On Mon, Dec 29, 2014 at 4:06 PM, Xiangrui Meng men...@gmail.com wrote:

 Hopefully the new pipeline API addresses this problem. We have a code
 example here:


 https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

 -Xiangrui

 On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com
 wrote:
  Here is what I did for this case :
 https://github.com/andypetrella/tf-idf
 
 
  Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :
 
  Given (label, terms) you can just transform the values to a TF vector,
  then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
  make a LabeledPoint from (label, vector) pairs. Is that what you're
  looking for?
 
  On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
   I found the TF-IDF feature extraction and all the MLlib code that
 work
   with
   pure Vector RDD very difficult to work with due to the lack of
 ability
   to
   associate vector back to the original data. Why can't Spark MLlib
   support
   LabeledPoint?
  
  
  
   --
   View this message in context:
  
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
   Sent from the Apache Spark User List mailing list archive at
 Nabble.com.
  
  
 -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
   For additional commands, e-mail: user-h...@spark.apache.org
  
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Using TF-IDF from MLlib

2014-12-29 Thread Sean Owen
Given (label, terms) you can just transform the values to a TF vector,
then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
make a LabeledPoint from (label, vector) pairs. Is that what you're
looking for?

On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
 I found the TF-IDF feature extraction and all the MLlib code that work with
 pure Vector RDD very difficult to work with due to the lack of ability to
 associate vector back to the original data. Why can't Spark MLlib support
 LabeledPoint?



 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using TF-IDF from MLlib

2014-12-29 Thread andy petrella
Here is what I did for this case : https://github.com/andypetrella/tf-idf

Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :

 Given (label, terms) you can just transform the values to a TF vector,
 then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
 make a LabeledPoint from (label, vector) pairs. Is that what you're
 looking for?

 On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
  I found the TF-IDF feature extraction and all the MLlib code that work
 with
  pure Vector RDD very difficult to work with due to the lack of ability to
  associate vector back to the original data. Why can't Spark MLlib support
  LabeledPoint?
 
 
 
  --
  View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Using TF-IDF from MLlib

2014-12-29 Thread Xiangrui Meng
Hopefully the new pipeline API addresses this problem. We have a code
example here:

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/SimpleTextClassificationPipeline.scala

-Xiangrui

On Mon, Dec 29, 2014 at 5:22 AM, andy petrella andy.petre...@gmail.com wrote:
 Here is what I did for this case : https://github.com/andypetrella/tf-idf


 Le lun 29 déc. 2014 11:31, Sean Owen so...@cloudera.com a écrit :

 Given (label, terms) you can just transform the values to a TF vector,
 then TF-IDF vector, with HashingTF and IDF / IDFModel. Then you can
 make a LabeledPoint from (label, vector) pairs. Is that what you're
 looking for?

 On Mon, Dec 29, 2014 at 3:37 AM, Yao y...@ford.com wrote:
  I found the TF-IDF feature extraction and all the MLlib code that work
  with
  pure Vector RDD very difficult to work with due to the lack of ability
  to
  associate vector back to the original data. Why can't Spark MLlib
  support
  LabeledPoint?
 
 
 
  --
  View this message in context:
  http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
  Sent from the Apache Spark User List mailing list archive at Nabble.com.
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Using TF-IDF from MLlib

2014-12-28 Thread Yao
I found the TF-IDF feature extraction and all the MLlib code that work with
pure Vector RDD very difficult to work with due to the lack of ability to
associate vector back to the original data. Why can't Spark MLlib support
LabeledPoint? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Using-TF-IDF-from-MLlib-tp19429p20876.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Using TF-IDF from MLlib

2014-11-21 Thread Daniel, Ronald (ELS-SDG)
Thanks for the info Andy. A big help.

One thing - I think you can figure out which document is responsible for which 
vector without checking in more code.
Start with a PairRDD of [doc_id, doc_string] for each document and split that 
into one RDD for each column.
The values in the doc_string RDD get split and turned into a Seq and fed to 
TFIDF.
You can take the resulting RDD[Vector]s and zip them with the doc_id RDD. 
Presto!

Best regards,
Ron





Re: Using TF-IDF from MLlib

2014-11-21 Thread andy petrella
Yeah, I initially used zip but I was wondering how reliable it is. I mean,
it's the order guaranteed? What if some mode fail, and the data is pulled
out from different nodes?
And even if it can work, I found this implicit semantic quite
uncomfortable, don't you?

My0.2c

Le ven 21 nov. 2014 15:26, Daniel, Ronald (ELS-SDG) r.dan...@elsevier.com
a écrit :

 Thanks for the info Andy. A big help.

 One thing - I think you can figure out which document is responsible for
 which vector without checking in more code.
 Start with a PairRDD of [doc_id, doc_string] for each document and split
 that into one RDD for each column.
 The values in the doc_string RDD get split and turned into a Seq and fed
 to TFIDF.
 You can take the resulting RDD[Vector]s and zip them with the doc_id RDD.
 Presto!

 Best regards,
 Ron






Using TF-IDF from MLlib

2014-11-20 Thread Daniel, Ronald (ELS-SDG)
Hi all,

I want to try the TF-IDF functionality in MLlib.
I can feed it words and generate the tf and idf  RDD[Vector]s, using the code 
below.
But how do I get this back to words and their counts and tf-idf values for 
presentation?


val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)
val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split( ).toSeq)
val hashingTF = new HashingTF()
val tf: RDD[Vector] = hashingTF.transform(documents)
tf.cache()
val idf = new IDF().fit(tf)
val tfidf: RDD[Vector] = idf.transform(tf)

It looks like I can get the indices of the terms using something like

J = wordListRDD.map(w = hashingTF.indexOf(w))

where wordList is an RDD holding the distinct words from the sequence of words 
used to come up with tf.
But how do I do the equivalent of

Counts  = J.map(j = tf.counts(j))  ?

Thanks,
Ron



Re: Using TF-IDF from MLlib

2014-11-20 Thread andy petrella
/Someone will correct me if I'm wrong./

Actually, TF-IDF scores terms for a given document, an specifically TF.
Internally, these things are holding a Vector (hopefully sparsed)
representing all the possible words (up to 2²⁰) per document. So each
document afer applying TF, will be transformed in a Vector. `indexOf` gives
the index in the latter Vector.

So you can ask the frequency for all the terms in *a doc* by looping on the
doc's terms and ask for the value hold in the vector at the place returned
by indexOf.

The problem you'll face in this case is that with the current
implementation it's hard to retrieve the document back. 'Cause the result
you'll have is only RDD[Vector]... so which item in your RDD is actually
the document you want?
I faced the same problem (for a demo I did at devoxx on the wikipedia
data), hence I've updated in a repo the code of TF-IDF to allow it to hold
a reference to the original document.
https://github.com/andypetrella/TF-IDF

If you use this impl (which I need to find some time to integrate in spark
:-/ ) you'll can build a pair RDD consisting (Path, Vector) for instance.
Then this pair RDD can be search (filter + take) for the doc you need and
finally asking for the freq (or even after the tfidf score)

HTH

andy





On Thu Nov 20 2014 at 1:14:24 AM Daniel, Ronald (ELS-SDG) 
r.dan...@elsevier.com wrote:

  Hi all,



 I want to try the TF-IDF functionality in MLlib.

 I can feed it words and generate the tf and idf  RDD[Vector]s, using the
 code below.

 But how do I get this back to words and their counts and tf-idf values for
 presentation?





 val sentsTmp = sqlContext.sql(SELECT text FROM sentenceTable)

 val documents: RDD[Seq[String]] = sentsTmp.map(_.toString.split( ).toSeq)

 val hashingTF = new HashingTF()

 val tf: RDD[Vector] = hashingTF.transform(documents)

 tf.cache()

 val idf = new IDF().fit(tf)

 val tfidf: RDD[Vector] = idf.transform(tf)



 It looks like I can get the indices of the terms using something like



 J = wordListRDD.map(w = hashingTF.indexOf(w))



 where wordList is an RDD holding the distinct words from the sequence of
 words used to come up with tf.

 But how do I do the equivalent of



 Counts  = J.map(j = tf.counts(j))  ?



 Thanks,

 Ron