Cosine Similarity Implementation in Spark

2017-01-30 Thread Manish Tripathi
I have a data frame which has two columns (id, vector (tf-idf)). The first
column signifies the Id of the document while the second column is a
Vector(tf-idf) values.

I want to use DIMSUM for cosine similarity but unfortunately I have Spark
1.x and looks like these methods are implemented only in Spark 2.x onwards
and hence the corresponding cosineSimilarity method for RowMatrix is not
there.

So I thought maybe I can use the cosineSimilarity method of
IndexedRowMatrix object as I see a corresponding cosine similarity method
for IndexedRowMatrix docs.

So here the couple of questions on the same.

1). So how do I first convert my spark data frame to IndexedRowMatrix
format?

2) Does cosine similarity method in IndexedRowMatrix also uses DIMSUM as
cosineSimilarity method of RowMatrix?

3). In RowMatrix, if I use Scala then I do have access to cosine similarity
method there. However , it gives a matrix of similarities with no row
indices (since RowMatrix is a index less matrix). So how do I infer the
cosine similarity of each doc id with other from the output of RowMatrix?

Please advise.

Link to docs on IndexedRowMatrix.

http://spark.apache.org/docs/latest/api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix.columnSimilarities
ᐧ


Latent Dirichlet Allocation in Spark

2017-02-16 Thread Manish Tripathi
Hi

I am trying to do topic modeling in Spark using Spark's LDA package. Using
Spark 2.0.2 and pyspark API.

I ran the code as below:

*from pyspark.ml.clustering import LDA*
*lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")*
*ldaModel=lda.fit(tf_df)*

*lda_df=ldaModel.transform(tf_df)*

I went through the docs to understand the output (the form of data) Spark
generates for LDA.

I understand the ldaModel.describeTopics() method. Gives topics with list
of terms and weights.

But I am not sure I understand the method ldamodel.topicsMatrix().

It gives me this:


​

if the doc says it is the distribution of words for each topic (1184 words
as rows, 10 topics as columns and the values of these cells. But then these
values are not probabilities which is what one would expect for topic-word
distribution.

These have random values more than 1 (132.76, 3.00 and so on).

Any jdea on this?

Thanks
ᐧ


Spark Float to VectorUDT for ML evaluator lib

2016-11-04 Thread Manish Tripathi
Hi

I am trying to run the ML Binary Evaluation Classifier metrics to compare
the rating with predicted values and get the AreaROC.

My dataframe has two columns with rating as int (I have binarized it) and
predicitions which is a float.

When I pass it to the ML evaluator method I get an error as shown below:

Can someone help me with gettng this sorted out?. Appreciate all the help

Stackoverflow post:

http://stackoverflow.com/questions/40408898/converting-the-float-column-in-spark-dataframe-to-vectorudt


I was trying to use the pyspark.ml.evaluation Binary classification metric
like below

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")

print evaluator.evaluate(predictions)

My Predictions data frame looks like this:

predictions.select('rating','prediction')

predictions.show()

+--++

|rating|  prediction|

+--++

| 1|  0.14829934|

| 1|-0.017862909|

| 1|   0.4951505|

| 1|0.0074382657|

| 1|-0.002562912|

| 1|   0.0208337|

| 1| 0.049362548|

| 1|  0.0969|

| 1|  0.17998546|

| 1| 0.019649783|

| 1| 0.031353004|

| 1|  0.03657037|

| 1|  0.23280995|

| 1| 0.033190556|

| 1|  0.35569906|

| 1| 0.030974165|

| 1|   0.1422375|

| 1|  0.19786166|

| 1|  0.07740938|

| 1|  0.33970386|

+--++

only showing top 20 rows

The datatype of each column is as follows:

predictions.printSchema()

root

 |-- rating: integer (nullable = true)

 |-- prediction: float (nullable = true)

Now I get an error with above Ml code saying prediction column is Float and
expected a VectorUDT.

/Users/i854319/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in
__call__(self, *args)

811 answer = self.gateway_client.send_command(command)

812 return_value = get_return_value(

--> 813 answer, self.gateway_client, self.target_id, self.name)

814

815 for temp_arg in temp_args:


/Users/i854319/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)

 51 raise AnalysisException(s.split(': ', 1)[1],
stackTrace)

 52 if s.startswith('java.lang.IllegalArgumentException: '):

---> 53 raise IllegalArgumentException(s.split(': ', 1)[1],
stackTrace)

 54 raise

 55 return deco


IllegalArgumentException: u'requirement failed: Column prediction must be
of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually
FloatType.'

So I thought of converting the predictions column from float to VectorUDT
as below:

*Applying the schema to the dataframe to convert the float column type to
VectorUDT*

from pyspark.sql.types import IntegerType, StructType,StructField


schema = StructType([

StructField("rating", IntegerType, True),

StructField("prediction", VectorUDT(), True)

])



predictions_dtype=sqlContext.createDataFrame(prediction,schema)

But Now I get this error.

---

AssertionErrorTraceback (most recent call last)

 in ()

  4

  5 schema = StructType([

> 6 StructField("rating", IntegerType, True),

  7 StructField("prediction", VectorUDT(), True)

  8 ])


/Users/i854319/spark/python/pyspark/sql/types.pyc in __init__(self, name,
dataType, nullable, metadata)

401 False

402 """

--> 403 assert isinstance(dataType, DataType), "dataType should be
DataType"

404 if not isinstance(name, str):

405 name = name.encode('utf-8')


AssertionError: dataType should be DataType

It takes so much time to run an ml algo in spark libraries with so many
weird errors. Even I tried Mllib with RDD data. That is giving the
ValueError: Null pointer exception.



ᐧ


Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
Hi

ran the ALS model for implicit feedback thing. Then I used the .transform
method of the model to predict the ratings for the original dataset. My
dataset is of the form (user,item,rating)

I see something like below:

predictions.show(5,truncate=False)


Why is the last prediction value negative ?. Isn't the transform method
giving the prediction(probability) of seeing the rating as 1?. I had counts
data for rating (implicit feedback) and for validation dataset I binarized
the rating (1 if >0 else 0). My training data has rating positive (it's
basically the count of views to a video).

I used following to train:

* als = ALS(rank=x, maxIter=15, regParam=y, implicitPrefs=True,alpha=40.0)*

*model=als.fit(self.train)*

What does negative prediction mean here and is it ok to have that?
ᐧ


Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
ok. Thanks. So here is what I understood.

Input data to Als.fit(implicitPrefs=True) is the actual strengths (count
data). So if I have a matrix of (user,item,views/purchases) I pass that as
the input and not the binarized one (preference). This signifies the
strength.

2) Since we also pass the alpha parameter to this Als.fit() method, Spark
internally creates the confidence matrix +1+alpha*input_data or some other
alpha factor.

3). The output which it gives is basically a factorization of 0/1 matrix
(binarized matrix from initial input data), hence the output also resembles
the preference matrix (0/1) suggesting the interaction. So typically it
should be between 0-1but if it is negative it means very less
preference/interaction

*Does all the above sound correct?.*

If yes, then one last question-

1). *For explicit dataset where we don't use implicitPref=True,* the
predicted ratings would be actual ratings like it can be 2.3,4.5 etc and
not the interaction measure. That is because in explicit we are not using
the confidence matrix and preference matrix concept and use the actual
rating data. So any output from Spark ALS for explicit data would be a
rating prediction.
ᐧ

On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen <so...@cloudera.com> wrote:

> No, input are weights or strengths. The output is a factorization of the
> binarization of that to 0/1, not probs or a factorization of the input.
> This explains the range of the output.
>
>
> On Thu, Dec 15, 2016, 23:43 Manish Tripathi <tr.man...@gmail.com> wrote:
>
>> when you say *implicit ALS *is* factoring the 0/1 matrix. , are you
>> saying for implicit feedback algorithm we need to pass the input data as
>> the preference matrix i.e a matrix of 0 and 1?. *
>>
>> Then how will they calculate the confidence matrix which is basically
>> =1+alpha*count matrix. If we don't pass the actual count of values (views
>> etc) then how does Spark calculates the confidence matrix?.
>>
>> I was of the understanding that input data for als.fit(implicitPref=True)
>> is the actual count matrix of the views/purchases?. Am I going wrong here
>> if yes, then how is Spark calculating the confidence matrix if it doesn't
>> have the actual count data.
>>
>> The original paper on which Spark algo is based needs the actual count
>> data to create a confidence matrix and also needs the 0/1 matrix since the
>> objective functions uses both the confidence matrix and 0/1 matrix to find
>> the user and item factors.
>> ᐧ
>>
>> On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> No, you can't interpret the output as probabilities at all. In particular
>> they may be negative. It is not predicting rating but interaction. Negative
>> means very strongly not predicted to interact. No, implicit ALS *is*
>> factoring the 0/1 matrix.
>>
>> On Thu, Dec 15, 2016, 23:31 Manish Tripathi <tr.man...@gmail.com> wrote:
>>
>> Ok. So we can kind of interpret the output as probabilities even though
>> it is not modeling probabilities. This is to be able to use it for
>> binaryclassification evaluator.
>>
>> So the way I understand is and as per the algo, the predicted matrix is
>> basically a dot product of user factor and item factor matrix.
>>
>> but in what circumstances the ratings predicted can be negative. I can
>> understand if the individual user factor vector and item factor vector is
>> having negative factor terms, then it can be negative. But practically does
>> negative make any sense? AS per algorithm the dot product is the predicted
>> rating. So rating shouldnt be negative for it to make any sense. Also
>> rating just between 0-1 is normalised rating? Typically rating we expect to
>> be like any real value 2.3,4.5 etc.
>>
>> Also please note, for implicit feedback ALS, we don't feed 0/1 matrix. We
>> feed the count matrix (discrete count values) and am assuming spark
>> internally converts it into a preference matrix (1/0) and a confidence
>> matrix =1+alpha*count_matrix
>>
>>
>>
>>
>> ᐧ
>>
>> On Thu, Dec 15, 2016 at 2:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> No, ALS is not modeling probabilities. The outputs are reconstructions of
>> a 0/1 matrix. Most values will be in [0,1], but, it's possible to get
>> values outside that range.
>>
>> On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi <tr.man...@gmail.com>
>> wrote:
>>
>> Hi
>>
>> ran the ALS model for implicit feedback thing. Then I used the .transform
>> method of the model to predict the ratings for the original dataset. My
>> dataset is of

Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
Ok. So we can kind of interpret the output as probabilities even though it
is not modeling probabilities. This is to be able to use it for
binaryclassification evaluator.

So the way I understand is and as per the algo, the predicted matrix is
basically a dot product of user factor and item factor matrix.

but in what circumstances the ratings predicted can be negative. I can
understand if the individual user factor vector and item factor vector is
having negative factor terms, then it can be negative. But practically does
negative make any sense? AS per algorithm the dot product is the predicted
rating. So rating shouldnt be negative for it to make any sense. Also
rating just between 0-1 is normalised rating? Typically rating we expect to
be like any real value 2.3,4.5 etc.

Also please note, for implicit feedback ALS, we don't feed 0/1 matrix. We
feed the count matrix (discrete count values) and am assuming spark
internally converts it into a preference matrix (1/0) and a confidence
matrix =1+alpha*count_matrix




ᐧ

On Thu, Dec 15, 2016 at 2:56 PM, Sean Owen <so...@cloudera.com> wrote:

> No, ALS is not modeling probabilities. The outputs are reconstructions of
> a 0/1 matrix. Most values will be in [0,1], but, it's possible to get
> values outside that range.
>
> On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi <tr.man...@gmail.com>
> wrote:
>
>> Hi
>>
>> ran the ALS model for implicit feedback thing. Then I used the .transform
>> method of the model to predict the ratings for the original dataset. My
>> dataset is of the form (user,item,rating)
>>
>> I see something like below:
>>
>> predictions.show(5,truncate=False)
>>
>>
>> Why is the last prediction value negative ?. Isn't the transform method
>> giving the prediction(probability) of seeing the rating as 1?. I had counts
>> data for rating (implicit feedback) and for validation dataset I binarized
>> the rating (1 if >0 else 0). My training data has rating positive (it's
>> basically the count of views to a video).
>>
>> I used following to train:
>>
>> * als = ALS(rank=x, maxIter=15, regParam=y,
>> implicitPrefs=True,alpha=40.0)*
>>
>> *model=als.fit(self.train)*
>>
>> What does negative prediction mean here and is it ok to have that?
>> ᐧ
>>
>


Re: Negative values of predictions in ALS.tranform

2016-12-15 Thread Manish Tripathi
when you say *implicit ALS *is* factoring the 0/1 matrix. , are you saying
for implicit feedback algorithm we need to pass the input data as the
preference matrix i.e a matrix of 0 and 1?. *

Then how will they calculate the confidence matrix which is basically
=1+alpha*count matrix. If we don't pass the actual count of values (views
etc) then how does Spark calculates the confidence matrix?.

I was of the understanding that input data for als.fit(implicitPref=True)
is the actual count matrix of the views/purchases?. Am I going wrong here
if yes, then how is Spark calculating the confidence matrix if it doesn't
have the actual count data.

The original paper on which Spark algo is based needs the actual count data
to create a confidence matrix and also needs the 0/1 matrix since the
objective functions uses both the confidence matrix and 0/1 matrix to find
the user and item factors.
ᐧ

On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen <so...@cloudera.com> wrote:

> No, you can't interpret the output as probabilities at all. In particular
> they may be negative. It is not predicting rating but interaction. Negative
> means very strongly not predicted to interact. No, implicit ALS *is*
> factoring the 0/1 matrix.
>
> On Thu, Dec 15, 2016, 23:31 Manish Tripathi <tr.man...@gmail.com> wrote:
>
>> Ok. So we can kind of interpret the output as probabilities even though
>> it is not modeling probabilities. This is to be able to use it for
>> binaryclassification evaluator.
>>
>> So the way I understand is and as per the algo, the predicted matrix is
>> basically a dot product of user factor and item factor matrix.
>>
>> but in what circumstances the ratings predicted can be negative. I can
>> understand if the individual user factor vector and item factor vector is
>> having negative factor terms, then it can be negative. But practically does
>> negative make any sense? AS per algorithm the dot product is the predicted
>> rating. So rating shouldnt be negative for it to make any sense. Also
>> rating just between 0-1 is normalised rating? Typically rating we expect to
>> be like any real value 2.3,4.5 etc.
>>
>> Also please note, for implicit feedback ALS, we don't feed 0/1 matrix. We
>> feed the count matrix (discrete count values) and am assuming spark
>> internally converts it into a preference matrix (1/0) and a confidence
>> matrix =1+alpha*count_matrix
>>
>>
>>
>>
>> ᐧ
>>
>> On Thu, Dec 15, 2016 at 2:56 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> No, ALS is not modeling probabilities. The outputs are reconstructions of
>> a 0/1 matrix. Most values will be in [0,1], but, it's possible to get
>> values outside that range.
>>
>> On Thu, Dec 15, 2016 at 10:21 PM Manish Tripathi <tr.man...@gmail.com>
>> wrote:
>>
>> Hi
>>
>> ran the ALS model for implicit feedback thing. Then I used the .transform
>> method of the model to predict the ratings for the original dataset. My
>> dataset is of the form (user,item,rating)
>>
>> I see something like below:
>>
>> predictions.show(5,truncate=False)
>>
>>
>> Why is the last prediction value negative ?. Isn't the transform method
>> giving the prediction(probability) of seeing the rating as 1?. I had counts
>> data for rating (implicit feedback) and for validation dataset I binarized
>> the rating (1 if >0 else 0). My training data has rating positive (it's
>> basically the count of views to a video).
>>
>> I used following to train:
>>
>> * als = ALS(rank=x, maxIter=15, regParam=y,
>> implicitPrefs=True,alpha=40.0)*
>>
>> *model=als.fit(self.train)*
>>
>> What does negative prediction mean here and is it ok to have that?
>> ᐧ
>>
>>
>>


Re: Negative values of predictions in ALS.tranform

2016-12-16 Thread Manish Tripathi
Thanks a bunch. That's very helpful.

On Friday, December 16, 2016, Sean Owen <so...@cloudera.com> wrote:

> That all looks correct.
>
> On Thu, Dec 15, 2016 at 11:54 PM Manish Tripathi <tr.man...@gmail.com
> <javascript:_e(%7B%7D,'cvml','tr.man...@gmail.com');>> wrote:
>
>> ok. Thanks. So here is what I understood.
>>
>> Input data to Als.fit(implicitPrefs=True) is the actual strengths (count
>> data). So if I have a matrix of (user,item,views/purchases) I pass that as
>> the input and not the binarized one (preference). This signifies the
>> strength.
>>
>> 2) Since we also pass the alpha parameter to this Als.fit() method, Spark
>> internally creates the confidence matrix +1+alpha*input_data or some other
>> alpha factor.
>>
>> 3). The output which it gives is basically a factorization of 0/1 matrix
>> (binarized matrix from initial input data), hence the output also resembles
>> the preference matrix (0/1) suggesting the interaction. So typically it
>> should be between 0-1but if it is negative it means very less
>> preference/interaction
>>
>> *Does all the above sound correct?.*
>>
>> If yes, then one last question-
>>
>> 1). *For explicit dataset where we don't use implicitPref=True,* the
>> predicted ratings would be actual ratings like it can be 2.3,4.5 etc and
>> not the interaction measure. That is because in explicit we are not using
>> the confidence matrix and preference matrix concept and use the actual
>> rating data. So any output from Spark ALS for explicit data would be a
>> rating prediction.
>> ᐧ
>>
>> On Thu, Dec 15, 2016 at 3:46 PM, Sean Owen <so...@cloudera.com
>> <javascript:_e(%7B%7D,'cvml','so...@cloudera.com');>> wrote:
>>
>>> No, input are weights or strengths. The output is a factorization of the
>>> binarization of that to 0/1, not probs or a factorization of the input.
>>> This explains the range of the output.
>>>
>>>
>>> On Thu, Dec 15, 2016, 23:43 Manish Tripathi <tr.man...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','tr.man...@gmail.com');>> wrote:
>>>
>>>> when you say *implicit ALS *is* factoring the 0/1 matrix. , are you
>>>> saying for implicit feedback algorithm we need to pass the input data as
>>>> the preference matrix i.e a matrix of 0 and 1?. *
>>>>
>>>> Then how will they calculate the confidence matrix which is basically
>>>> =1+alpha*count matrix. If we don't pass the actual count of values (views
>>>> etc) then how does Spark calculates the confidence matrix?.
>>>>
>>>> I was of the understanding that input data for
>>>> als.fit(implicitPref=True) is the actual count matrix of the
>>>> views/purchases?. Am I going wrong here if yes, then how is Spark
>>>> calculating the confidence matrix if it doesn't have the actual count data.
>>>>
>>>> The original paper on which Spark algo is based needs the actual count
>>>> data to create a confidence matrix and also needs the 0/1 matrix since the
>>>> objective functions uses both the confidence matrix and 0/1 matrix to find
>>>> the user and item factors.
>>>> ᐧ
>>>>
>>>> On Thu, Dec 15, 2016 at 3:38 PM, Sean Owen <so...@cloudera.com
>>>> <javascript:_e(%7B%7D,'cvml','so...@cloudera.com');>> wrote:
>>>>
>>>>> No, you can't interpret the output as probabilities at all. In
>>>>> particular they may be negative. It is not predicting rating but
>>>>> interaction. Negative means very strongly not predicted to interact. No,
>>>>> implicit ALS *is* factoring the 0/1 matrix.
>>>>>
>>>>> On Thu, Dec 15, 2016, 23:31 Manish Tripathi <tr.man...@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','tr.man...@gmail.com');>> wrote:
>>>>>
>>>>>> Ok. So we can kind of interpret the output as probabilities even
>>>>>> though it is not modeling probabilities. This is to be able to use it for
>>>>>> binaryclassification evaluator.
>>>>>>
>>>>>> So the way I understand is and as per the algo, the predicted matrix
>>>>>> is basically a dot product of user factor and item factor matrix.
>>>>>>
>>>>>> but in what circumstances the ratings predicted can be negative. I
>>>>>> can understand if the individual user factor vector and item factor 
>>>>>> v

Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms
of few words.

I see something like this:


​

I do not understand why the cosine similarity is being calculated as more
than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
(taking negative angles).

Why it is more than 1 here? What's going wrong here?.

Please note, normalization of the vectors should not be changing the cosine
similarity values since the formula remains the same. If you normalise it's
just a dot product then, if you don't it's dot product/ (normA)*(normB).

I am facing lot of issues with respect to understanding or interpreting the
output of Spark's ml algos. The documentation is not very clear and there
is hardly anything mentioned with respect to how and what is being
returned.

For ex. word2vec algorithm is to convert word to vector form. So I would
expect .transform method would give me vector of each word in the text.

However .transform basically returns doc2vec (averages all word vectors of
a text). This is confusing since nothing of this is mentioned in the docs
and I keep thinking why I have only one word vector instead of word vectors
for all words.

I do understand by returning doc2vec it is helpful since now one doesn't
have to average out each word vector for the whole text. But the docs don't
help or explicitly say that.

This ends up wasting lot of time in just figuring out what is being
returned from an algorithm from Spark.

Does someone have a better solution for this?

I have read the Spark book. That is not about Mllib. I am not sure if
Sean's book would cover all the documentation aspect better than what we
have currently on the docs page.

Thanks



ᐧ


Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
Sean,

Thanks for answer. I am using Spark 1.6 so are you saying the output I am
getting is cos(A,B)=dot(A,B)/norm(A) ?

My point with respect to normalization was that if you normalise or don't
normalize both vectors A,B, the output would be same. Since if I normalize
A and B, then

Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If we
don't normalize it would have a norm in the denominator so output is same.

But I understand you are saying in Spark 1.x, one vector was not
normalized. If that is the case then it makes sense.

Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
If the input word in findSynonyms is not normalized while calculating
cosine, then doing w2vmodel.transform(input_word) to get a vector
representation and then diving the current result by the norm of this
vector should be correct?

Also, I am very open to editing the docs on things I find not properly
documented or wrong, but I need to know if that is allowed (is it like a
Wiki)?.
ᐧ

On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote:

> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com>
> wrote:
>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> ​
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>> For ex. word2vec algorithm is to convert word to vector form. So I would
>> expect .transform method would give me vector of each word in the text.
>>
>> However .transform basically returns doc2vec (averages all word vectors
>> of a text). This is confusing since nothing of this is mentioned in the
>> docs and I keep thinking why I have only one word vector instead of word
>> vectors for all words.
>>
>> I do understand by returning doc2vec it is helpful since now one doesn't
>> have to average out each word vector for the whole text. But the docs don't
>> help or explicitly say that.
>>
>> This ends up wasting lot of time in just figuring out what is being
>> returned from an algorithm from Spark.
>>
>> Does someone have a better solution for this?
>>
>> I have read the Spark book. That is not about Mllib. I am not sure if
>> Sean's book would cover all the documentation aspect better than what we
>> have currently on the docs page.
>>
>> Thanks
>>
>>
>>
>> ᐧ
>>
>


Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
ok got that. I understand that the ordering won't change. I just wanted to
make sure I am getting the right thing or I understand what I am getting
since it didn't make sense going by the cosine calculation.

One last confirmation and I appreciate all the time you are spending to
reply:

In the github link issue , it's mentioned fVector is not normalised. By
fVector it's meant the word vector we want to find synonyms for?. So in my
example, it would be vector for 'science' word which I passed to the
method?.

if yes, then I guess, solution should be simple. Just divide the current
cosine output by the norm of this vector. And this vector we can get by
doing model.transform('science') if I am right?

Lastly, I would be very happy to update to docs if it is editable for all
the things I encounter as not mentioned or not very clear.
ᐧ

On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen <so...@cloudera.com> wrote:

> Yes, the vectors are not otherwise normalized.
> You are basically getting the cosine similarity, but times the norm of the
> word vector you supplied, because it's not divided through. You could just
> divide the results yourself.
> I don't think it will be back-ported because the the behavior was intended
> in 1.x, just wrongly documented, and we don't want to change the behavior
> in 1.x. The results are still correctly ordered anyway.
>
> On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi <tr.man...@gmail.com>
> wrote:
>
>> Sean,
>>
>> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
>> getting is cos(A,B)=dot(A,B)/norm(A) ?
>>
>> My point with respect to normalization was that if you normalise or don't
>> normalize both vectors A,B, the output would be same. Since if I normalize
>> A and B, then
>>
>> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
>> we don't normalize it would have a norm in the denominator so output is
>> same.
>>
>> But I understand you are saying in Spark 1.x, one vector was not
>> normalized. If that is the case then it makes sense.
>>
>> Any idea how to fix this (get the right cosine similarity) in Spark 1.x?
>> . If the input word in findSynonyms is not normalized while calculating
>> cosine, then doing w2vmodel.transform(input_word) to get a vector
>> representation and then diving the current result by the norm of this
>> vector should be correct?
>>
>> Also, I am very open to editing the docs on things I find not properly
>> documented or wrong, but I need to know if that is allowed (is it like a
>> Wiki)?.
>> ᐧ
>>
>> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>> It should be the cosine similarity, yes. I think this is what was fixed
>> in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
>> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
>> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
>> The normalization most certainly matters here, and it's the opposite:
>> dividing the dot by vec norms gives you the cosine.
>>
>> Although docs can always be better (and here was a case where it was
>> wrong) all of this comes with javadoc and examples. Right now at least,
>> .transform() describes the operation as you do, so it is documented. I'd
>> propose you invest in improving the docs rather than saying 'this isn't
>> what I expected'.
>>
>> (No, our book isn't a reference for MLlib, more like worked examples)
>>
>> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi <tr.man...@gmail.com>
>> wrote:
>>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> ​
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>&g