Cosine Similarity Implementation in Spark

2017-01-30 Thread Manish Tripathi
I have a data frame which has two columns (id, vector (tf-idf)). The first column signifies the Id of the document while the second column is a Vector(tf-idf) values. I want to use DIMSUM for cosine similarity but unfortunately I have Spark 1.x and looks like these methods are implemented only in

Latent Dirichlet Allocation in Spark

2017-02-16 Thread Manish Tripathi
Hi I am trying to do topic modeling in Spark using Spark's LDA package. Using Spark 2.0.2 and pyspark API. I ran the code as below: *from pyspark.ml.clustering import LDA* *lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")* *ldaModel=lda.fit(tf_df)*

Spark Float to VectorUDT for ML evaluator lib

2016-11-04 Thread Manish Tripathi
Hi I am trying to run the ML Binary Evaluation Classifier metrics to compare the rating with predicted values and get the AreaROC. My dataframe has two columns with rating as int (I have binarized it) and predicitions which is a float. When I pass it to the ML evaluator method I get an error as

Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
Hi ran the ALS model for implicit feedback thing. Then I used the .transform method of the model to predict the ratings for the original dataset. My dataset is of the form (user,item,rating) I see something like below: predictions.show(5,truncate=False) Why is the last prediction value

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
..@cloudera.com> wrote: >> >> No, you can't interpret the output as probabilities at all. In particular >> they may be negative. It is not predicting rating but interaction. Negative >> means very strongly not predicted to interact. No, implicit ALS *is* &g

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
a 0/1 matrix. Most values will be in [0,1], but, it's possible to get > values outside that range. > > On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > >> Hi >> >> ran the ALS model for implicit feedback thing. Then I us

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
o, implicit ALS *is* > factoring the 0/1 matrix. > > On Thu, Dec 15, 2016, 23:31 Manish Tripathi <tr.man...@gmail.com> wrote: > >> Ok. So we can kind of interpret the output as probabilities even though >> it is not modeling probabilities. This is to be able to use it fo

Re: Negative values of predictions in ALS.tranform

2016-12-16 Thread Manish Tripathi
Thanks a bunch. That's very helpful. On Friday, December 16, 2016, Sean Owen <so...@cloudera.com> wrote: > That all looks correct. > > On Thu, Dec 15, 2016 at 11:54 PM Manish Tripathi <tr.man...@gmail.com > <javascript:_e(%7B%7D,'cvml','tr.man...@gmail.com');>> w

Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text. I then used the findSynonyms function of the model object to get synonyms of few words. I see something like this: ​ I do not understand why the cosine similarity is being calculated as more than 1. Cosine similarity

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
nvest in improving the docs rather than saying 'this isn't > what I expected'. > > (No, our book isn't a reference for MLlib, more like worked examples) > > On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > >> I used a word2vec algori

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
e back-ported because the the behavior was intended > in 1.x, just wrongly documented, and we don't want to change the behavior > in 1.x. The results are still correctly ordered anyway. > > On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com> > wrote: > >&