Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
Yes you're just dividing by the norm of the vector you passed in. You can
look at the change on that JIRA and probably see how this was added into
the method itself.

On Thu, Dec 29, 2016 at 10:34 PM Manish Tripathi 
wrote:

> ok got that. I understand that the ordering won't change. I just wanted to
> make sure I am getting the right thing or I understand what I am getting
> since it didn't make sense going by the cosine calculation.
>
> One last confirmation and I appreciate all the time you are spending to
> reply:
>
> In the github link issue , it's mentioned fVector is not normalised. By
> fVector it's meant the word vector we want to find synonyms for?. So in my
> example, it would be vector for 'science' word which I passed to the
> method?.
>
> if yes, then I guess, solution should be simple. Just divide the current
> cosine output by the norm of this vector. And this vector we can get by
> doing model.transform('science') if I am right?
>
> Lastly, I would be very happy to update to docs if it is editable for all
> the things I encounter as not mentioned or not very clear.
> ᐧ
>
> On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen  wrote:
>
> Yes, the vectors are not otherwise normalized.
> You are basically getting the cosine similarity, but times the norm of the
> word vector you supplied, because it's not divided through. You could just
> divide the results yourself.
> I don't think it will be back-ported because the the behavior was intended
> in 1.x, just wrongly documented, and we don't want to change the behavior
> in 1.x. The results are still correctly ordered anyway.
>
> On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi 
> wrote:
>
> Sean,
>
> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
> getting is cos(A,B)=dot(A,B)/norm(A) ?
>
> My point with respect to normalization was that if you normalise or don't
> normalize both vectors A,B, the output would be same. Since if I normalize
> A and B, then
>
> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
> we don't normalize it would have a norm in the denominator so output is
> same.
>
> But I understand you are saying in Spark 1.x, one vector was not
> normalized. If that is the case then it makes sense.
>
> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
> If the input word in findSynonyms is not normalized while calculating
> cosine, then doing w2vmodel.transform(input_word) to get a vector
> representation and then diving the current result by the norm of this
> vector should be correct?
>
> Also, I am very open to editing the docs on things I find not properly
> documented or wrong, but I need to know if that is allowed (is it like a
> Wiki)?.
> ᐧ
>
> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:
>
> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
> wrote:
>
> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> ​
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the 

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
ok got that. I understand that the ordering won't change. I just wanted to
make sure I am getting the right thing or I understand what I am getting
since it didn't make sense going by the cosine calculation.

One last confirmation and I appreciate all the time you are spending to
reply:

In the github link issue , it's mentioned fVector is not normalised. By
fVector it's meant the word vector we want to find synonyms for?. So in my
example, it would be vector for 'science' word which I passed to the
method?.

if yes, then I guess, solution should be simple. Just divide the current
cosine output by the norm of this vector. And this vector we can get by
doing model.transform('science') if I am right?

Lastly, I would be very happy to update to docs if it is editable for all
the things I encounter as not mentioned or not very clear.
ᐧ

On Thu, Dec 29, 2016 at 2:28 PM, Sean Owen  wrote:

> Yes, the vectors are not otherwise normalized.
> You are basically getting the cosine similarity, but times the norm of the
> word vector you supplied, because it's not divided through. You could just
> divide the results yourself.
> I don't think it will be back-ported because the the behavior was intended
> in 1.x, just wrongly documented, and we don't want to change the behavior
> in 1.x. The results are still correctly ordered anyway.
>
> On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi 
> wrote:
>
>> Sean,
>>
>> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
>> getting is cos(A,B)=dot(A,B)/norm(A) ?
>>
>> My point with respect to normalization was that if you normalise or don't
>> normalize both vectors A,B, the output would be same. Since if I normalize
>> A and B, then
>>
>> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
>> we don't normalize it would have a norm in the denominator so output is
>> same.
>>
>> But I understand you are saying in Spark 1.x, one vector was not
>> normalized. If that is the case then it makes sense.
>>
>> Any idea how to fix this (get the right cosine similarity) in Spark 1.x?
>> . If the input word in findSynonyms is not normalized while calculating
>> cosine, then doing w2vmodel.transform(input_word) to get a vector
>> representation and then diving the current result by the norm of this
>> vector should be correct?
>>
>> Also, I am very open to editing the docs on things I find not properly
>> documented or wrong, but I need to know if that is allowed (is it like a
>> Wiki)?.
>> ᐧ
>>
>> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:
>>
>> It should be the cosine similarity, yes. I think this is what was fixed
>> in https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
>> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
>> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
>> The normalization most certainly matters here, and it's the opposite:
>> dividing the dot by vec norms gives you the cosine.
>>
>> Although docs can always be better (and here was a case where it was
>> wrong) all of this comes with javadoc and examples. Right now at least,
>> .transform() describes the operation as you do, so it is documented. I'd
>> propose you invest in improving the docs rather than saying 'this isn't
>> what I expected'.
>>
>> (No, our book isn't a reference for MLlib, more like worked examples)
>>
>> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
>> wrote:
>>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> ​
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>> For ex. word2vec algorithm is to convert word to vector form. So I would
>> expect .transform method would give me vector of each word in the text.
>>
>> However .transform basically returns doc2vec (averages all word vectors
>> of a text). This is confusing since nothing of this is mentioned in the
>> docs and I keep thinking why I have only one word vector instead of word
>> vectors for all words.
>>
>> I do understand by returning doc2vec it is helpful since now one doesn't
>> have to average out 

Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
Yes, the vectors are not otherwise normalized.
You are basically getting the cosine similarity, but times the norm of the
word vector you supplied, because it's not divided through. You could just
divide the results yourself.
I don't think it will be back-ported because the the behavior was intended
in 1.x, just wrongly documented, and we don't want to change the behavior
in 1.x. The results are still correctly ordered anyway.

On Thu, Dec 29, 2016 at 10:11 PM Manish Tripathi 
wrote:

> Sean,
>
> Thanks for answer. I am using Spark 1.6 so are you saying the output I am
> getting is cos(A,B)=dot(A,B)/norm(A) ?
>
> My point with respect to normalization was that if you normalise or don't
> normalize both vectors A,B, the output would be same. Since if I normalize
> A and B, then
>
> Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If
> we don't normalize it would have a norm in the denominator so output is
> same.
>
> But I understand you are saying in Spark 1.x, one vector was not
> normalized. If that is the case then it makes sense.
>
> Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
> If the input word in findSynonyms is not normalized while calculating
> cosine, then doing w2vmodel.transform(input_word) to get a vector
> representation and then diving the current result by the norm of this
> vector should be correct?
>
> Also, I am very open to editing the docs on things I find not properly
> documented or wrong, but I need to know if that is allowed (is it like a
> Wiki)?.
> ᐧ
>
> On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:
>
> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
> wrote:
>
> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> ​
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the docs
> and I keep thinking why I have only one word vector instead of word vectors
> for all words.
>
> I do understand by returning doc2vec it is helpful since now one doesn't
> have to average out each word vector for the whole text. But the docs don't
> help or explicitly say that.
>
> This ends up wasting lot of time in just figuring out what is being
> returned from an algorithm from Spark.
>
> Does someone have a better solution for this?
>
> I have read the Spark book. That is not about Mllib. I am not sure if
> Sean's book would cover all the documentation aspect better than what we
> have currently on the docs page.
>
> Thanks
>
>
>
> ᐧ
>
>
>


Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
Sean,

Thanks for answer. I am using Spark 1.6 so are you saying the output I am
getting is cos(A,B)=dot(A,B)/norm(A) ?

My point with respect to normalization was that if you normalise or don't
normalize both vectors A,B, the output would be same. Since if I normalize
A and B, then

Cos(A,B)= dot(A,B)/norm(A)*norm(B). since norm=1 it is just dot(A,B). If we
don't normalize it would have a norm in the denominator so output is same.

But I understand you are saying in Spark 1.x, one vector was not
normalized. If that is the case then it makes sense.

Any idea how to fix this (get the right cosine similarity) in Spark 1.x? .
If the input word in findSynonyms is not normalized while calculating
cosine, then doing w2vmodel.transform(input_word) to get a vector
representation and then diving the current result by the norm of this
vector should be correct?

Also, I am very open to editing the docs on things I find not properly
documented or wrong, but I need to know if that is allowed (is it like a
Wiki)?.
ᐧ

On Thu, Dec 29, 2016 at 1:59 PM, Sean Owen  wrote:

> It should be the cosine similarity, yes. I think this is what was fixed in
> https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was
> really just outputting the 'unnormalized' similarity (dot / norm(a) only)
> but the docs said cosine similarity. Now it's cosine similarity in Spark 2.
> The normalization most certainly matters here, and it's the opposite:
> dividing the dot by vec norms gives you the cosine.
>
> Although docs can always be better (and here was a case where it was
> wrong) all of this comes with javadoc and examples. Right now at least,
> .transform() describes the operation as you do, so it is documented. I'd
> propose you invest in improving the docs rather than saying 'this isn't
> what I expected'.
>
> (No, our book isn't a reference for MLlib, more like worked examples)
>
> On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi 
> wrote:
>
>> I used a word2vec algorithm of spark to compute documents vector of a
>> text.
>>
>> I then used the findSynonyms function of the model object to get
>> synonyms of few words.
>>
>> I see something like this:
>>
>>
>> ​
>>
>> I do not understand why the cosine similarity is being calculated as more
>> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
>> (taking negative angles).
>>
>> Why it is more than 1 here? What's going wrong here?.
>>
>> Please note, normalization of the vectors should not be changing the
>> cosine similarity values since the formula remains the same. If you
>> normalise it's just a dot product then, if you don't it's dot product/
>> (normA)*(normB).
>>
>> I am facing lot of issues with respect to understanding or interpreting
>> the output of Spark's ml algos. The documentation is not very clear and
>> there is hardly anything mentioned with respect to how and what is being
>> returned.
>>
>> For ex. word2vec algorithm is to convert word to vector form. So I would
>> expect .transform method would give me vector of each word in the text.
>>
>> However .transform basically returns doc2vec (averages all word vectors
>> of a text). This is confusing since nothing of this is mentioned in the
>> docs and I keep thinking why I have only one word vector instead of word
>> vectors for all words.
>>
>> I do understand by returning doc2vec it is helpful since now one doesn't
>> have to average out each word vector for the whole text. But the docs don't
>> help or explicitly say that.
>>
>> This ends up wasting lot of time in just figuring out what is being
>> returned from an algorithm from Spark.
>>
>> Does someone have a better solution for this?
>>
>> I have read the Spark book. That is not about Mllib. I am not sure if
>> Sean's book would cover all the documentation aspect better than what we
>> have currently on the docs page.
>>
>> Thanks
>>
>>
>>
>> ᐧ
>>
>


Re: Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Sean Owen
It should be the cosine similarity, yes. I think this is what was fixed in
https://issues.apache.org/jira/browse/SPARK-7617 ; previously it was really
just outputting the 'unnormalized' similarity (dot / norm(a) only) but the
docs said cosine similarity. Now it's cosine similarity in Spark 2. The
normalization most certainly matters here, and it's the opposite: dividing
the dot by vec norms gives you the cosine.

Although docs can always be better (and here was a case where it was wrong)
all of this comes with javadoc and examples. Right now at least,
.transform() describes the operation as you do, so it is documented. I'd
propose you invest in improving the docs rather than saying 'this isn't
what I expected'.

(No, our book isn't a reference for MLlib, more like worked examples)

On Thu, Dec 29, 2016 at 9:49 PM Manish Tripathi  wrote:

> I used a word2vec algorithm of spark to compute documents vector of a text.
>
> I then used the findSynonyms function of the model object to get synonyms
> of few words.
>
> I see something like this:
>
>
> ​
>
> I do not understand why the cosine similarity is being calculated as more
> than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
> (taking negative angles).
>
> Why it is more than 1 here? What's going wrong here?.
>
> Please note, normalization of the vectors should not be changing the
> cosine similarity values since the formula remains the same. If you
> normalise it's just a dot product then, if you don't it's dot product/
> (normA)*(normB).
>
> I am facing lot of issues with respect to understanding or interpreting
> the output of Spark's ml algos. The documentation is not very clear and
> there is hardly anything mentioned with respect to how and what is being
> returned.
>
> For ex. word2vec algorithm is to convert word to vector form. So I would
> expect .transform method would give me vector of each word in the text.
>
> However .transform basically returns doc2vec (averages all word vectors of
> a text). This is confusing since nothing of this is mentioned in the docs
> and I keep thinking why I have only one word vector instead of word vectors
> for all words.
>
> I do understand by returning doc2vec it is helpful since now one doesn't
> have to average out each word vector for the whole text. But the docs don't
> help or explicitly say that.
>
> This ends up wasting lot of time in just figuring out what is being
> returned from an algorithm from Spark.
>
> Does someone have a better solution for this?
>
> I have read the Spark book. That is not about Mllib. I am not sure if
> Sean's book would cover all the documentation aspect better than what we
> have currently on the docs page.
>
> Thanks
>
>
>
> ᐧ
>


Cosine Similarity of Word2Vec algo more than 1?

2016-12-29 Thread Manish Tripathi
I used a word2vec algorithm of spark to compute documents vector of a text.

I then used the findSynonyms function of the model object to get synonyms
of few words.

I see something like this:


​

I do not understand why the cosine similarity is being calculated as more
than 1. Cosine similarity should be between 0 and 1 or max -1 and +1
(taking negative angles).

Why it is more than 1 here? What's going wrong here?.

Please note, normalization of the vectors should not be changing the cosine
similarity values since the formula remains the same. If you normalise it's
just a dot product then, if you don't it's dot product/ (normA)*(normB).

I am facing lot of issues with respect to understanding or interpreting the
output of Spark's ml algos. The documentation is not very clear and there
is hardly anything mentioned with respect to how and what is being
returned.

For ex. word2vec algorithm is to convert word to vector form. So I would
expect .transform method would give me vector of each word in the text.

However .transform basically returns doc2vec (averages all word vectors of
a text). This is confusing since nothing of this is mentioned in the docs
and I keep thinking why I have only one word vector instead of word vectors
for all words.

I do understand by returning doc2vec it is helpful since now one doesn't
have to average out each word vector for the whole text. But the docs don't
help or explicitly say that.

This ends up wasting lot of time in just figuring out what is being
returned from an algorithm from Spark.

Does someone have a better solution for this?

I have read the Spark book. That is not about Mllib. I am not sure if
Sean's book would cover all the documentation aspect better than what we
have currently on the docs page.

Thanks



ᐧ