Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-216047890
@srowen my JIRA username is "flysjy", thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-216017374
LGTM thanks all for the patch & reviews!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215948829
@flyjy if you tell me your JIRA handle I'll assign to you
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/11812
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215948535
Merged to master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215819301
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215819305
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215819124
**[Test build #57346 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57346/consoleFull)**
for PR 11812 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215798743
**[Test build #57346 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57346/consoleFull)**
for PR 11812 at commit
Github user jyshen15 commented on a diff in the pull request:
https://github.com/apache/spark/pull/11812#discussion_r61606066
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -108,5 +108,26 @@ class Word2VecSuite extends SparkFunSuite with
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215664477
I confirmed the test case fails on master without the changes in this PR.
LGTM.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/11812#discussion_r61549783
--- Diff:
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -108,5 +108,26 @@ class Word2VecSuite extends SparkFunSuite with
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215661660
@jkbradley are you OK with the test here?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215632760
@srowen The PR with unit testing passed after rebasing master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215632322
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215632323
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215632229
**[Test build #57311 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57311/consoleFull)**
for PR 11812 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215628418
**[Test build #57311 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57311/consoleFull)**
for PR 11812 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215601512
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215601488
**[Test build #57290 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57290/consoleFull)**
for PR 11812 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215601509
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-215598306
**[Test build #57290 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57290/consoleFull)**
for PR 11812 at commit
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-214970336
Yes, I am working it. Will finish tomorrow.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-214676646
@flyjy are you updating this? it's almost done I think.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-213528482
@srowen , I agree with you. That is a good idea to skip the word2vec
iteration step, and directly initialize the `Word2VecModel` class. Will go with
this approach.
---
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-213295833
It depends on the license of this corpus. Is it sufficient to test behavior
with a very large input vector? or are we not so clear that's the issue?
---
If your
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-213232923
That is a good idea about the unit test. I actually first included the unit
test codes of @MLnick on March 22 with Lee corpus from Gensim, but later did
not include them
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-213203872
+1 for @MLnick 's suggestion of adding a unit test to mllib/tests.py which
fails before your fix
---
If your project is set up for it, you can reply to this email
Github user PhoenixDai commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212969679
Yes, it's reproducible as mentioned in the third comment at
https://issues.apache.org/jira/browse/SPARK-13289
I thought this PR will solve the issue. Isn't
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212921574
The result here is a similarity rather than a distance. It should never be
more than 1, unless there's a bug, because it's a cosine similarity. I can see
here a case
Github user PhoenixDai commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212915342
My observation (of the current implementation of word2vec) is that the
distances between synonyms are getting larger and larger with more iterations
and finally to
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212828428
@srowen was about to ping you on this. Yup, that is basically the idea. I
would prefer to add a test case here, where it fails without the changes in the
PR.
---
If
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212822658
Is the problem that the input vector may have a very large norm, causing
the dot product with other vectors to be Infinity? There's a little, opposite
problem: dividing
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212175243
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212175249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212175003
**[Test build #56290 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56290/consoleFull)**
for PR 11812 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-212165957
**[Test build #56290 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56290/consoleFull)**
for PR 11812 at commit
Github user jyshen15 commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-211756190
i will handle the python style issue
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-211750751
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-211750733
**[Test build #56198 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56198/consoleFull)**
for PR 11812 at commit
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-211750648
Thanks. Have updated the PR.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-211750023
**[Test build #56198 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56198/consoleFull)**
for PR 11812 at commit
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-211240792
You need to do `model.findSynonyms("a", 2).select("word", fmt("similarity",
5).alias("similarity")).show()` in order to truncate the `similarity` col to 5
significant
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-210930706
**[Test build #56028 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56028/consoleFull)**
for PR 11812 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-210930845
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-210930838
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-210926153
**[Test build #56028 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56028/consoleFull)**
for PR 11812 at commit
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-208894112
Test failure looks like small precision issue. You can do the following
perhaps in the doc string test:
```
>>> from pyspark.sql.functions import
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207856560
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207856559
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207856553
**[Test build #55446 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55446/consoleFull)**
for PR 11812 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207855658
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207855659
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207855649
**[Test build #55445 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55445/consoleFull)**
for PR 11812 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207854058
**[Test build #55446 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55446/consoleFull)**
for PR 11812 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-207853515
**[Test build #55445 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55445/consoleFull)**
for PR 11812 at commit
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-204128446
Also for some reason there are a huge number of files changed in the GitHub
view. Perhaps an issue with rebase / merge with current master?
---
If your project is set
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-204126393
Yeah I did think about that too. There is a `TODO` to adjust the learning
rate by iteration. But I think it makes this PR easier to analyze if the only
change is
Github user PhoenixDai commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-204120379
How about keep the learning rate related code unchanged?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-203516285
Looks like some the pySpark unit tests expect to have
++---+
|word| similarity|
++---+
|
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-203512904
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-203512900
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-203512838
**[Test build #54529 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54529/consoleFull)**
for PR 11812 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-203499113
**[Test build #54529 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54529/consoleFull)**
for PR 11812 at commit
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-203389060
@flyjy perhaps try rebasing to current master just in case?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user PhoenixDai commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202892564
Is this caused by the changes made on word2vec.scala after this PR was
initialed? Maybe the change developed a conflict to this PR. (This is just my
naive guess. I
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202831224
**[Test build #54432 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54432/consoleFull)**
for PR 11812 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202831264
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202831267
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202824019
**[Test build #54432 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54432/consoleFull)**
for PR 11812 at commit
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202823312
ok to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user PhoenixDai commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-202224483
I tested this commit on the "One Billion Words Language Modeling" dataset
with 72 partitions and 15 iterations. It works well.
---
If your project is set up for
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-201971518
@MLnick This bug has been fixed without changing existing interfaces. Have
tested it with your test script with Lee corpus from Gensim.
I am not sure whether you
Github user jyshen15 commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-200416321
@MLnick cool! It actually comes down to the question that what should the
`getVectors ` outputs? If the equation ` getVectors("Paris") -
getVectors("France") +
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-199698551
Here is my test case - I can replicate the `Infinity` similarities on a
small test dataset. It only occurs when the num partitions and the num
iterations is very high.
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-19474
Yes, please don't change any existing behavior of public methods.
Ok - I also managed to create a small test case that replicates the issue.
I verified that
Github user flyjy commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-198852800
Thanks. I have checked that the problem still exists with only the adaptive
learning rate change.
So, I will fix this bug without change the existing interface.
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/11812#discussion_r56619432
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -463,12 +465,17 @@ class Word2VecModel private[spark] (
//
GitHub user flyjy opened a pull request:
https://github.com/apache/spark/pull/11812
[SPARK-13289][MLLIB] Fix infinite distances between word vectors in
Word2VecModel
## What changes were proposed in this pull request?
This PR fixes the bug that generates infinite distances
Github user MLnick commented on a diff in the pull request:
https://github.com/apache/spark/pull/11812#discussion_r56619995
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -532,28 +539,14 @@ class Word2VecModel private[spark] (
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-198242957
Thanks for this. While I see that normalizing the vectors internally may be
useful, it does change behaviour in the `getVectors` and `findSynonyms`
methods. See my
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-198194974
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user MLnick commented on the pull request:
https://github.com/apache/spark/pull/11812#issuecomment-198270501
It would also be ideal to create a test case that can replicate the issue
with the old code, and pass with the new code, for regression testing going
forward.
---
If
83 matches
Mail list logo