[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15105 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79158292 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -227,7 +227,7 @@ class Word2VecModel private[ml] ( */ @Since("1.5.0") def findSynonyms(word: String, num: Int): DataFrame = { -findSynonyms(wordVectors.transform(word), num) +findSynonyms(wordVectors.transform(word), num, Some(word)) --- End diff -- I see. I agree that something gets a little bit duplicated no matter what. Given your change, it seems easiest to pass the string vs vector argument all the way down, even if that means in these other two classes you duplicate a little code to transform the dataframe, etc. If it's nontrivial, sure, a little helper method seems reasonable. That would help keep this layered more cleanly IMHO. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79157601 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -227,7 +227,7 @@ class Word2VecModel private[ml] ( */ @Since("1.5.0") def findSynonyms(word: String, num: Int): DataFrame = { -findSynonyms(wordVectors.transform(word), num) +findSynonyms(wordVectors.transform(word), num, Some(word)) --- End diff -- In this case (and similarly in [`Word2VecModelWrapper`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala)) I opted to call the three-argument version because the wrappers both explicitly convert their argument to a vector before calling `findSynonyms` on the underlying model (and so `wordOpt` would not be defined if the wrapper were invoked with a word). If we were to make the three-argument `findSynonyms` private we wouldn't be able to share a code path in the wrapper classes and would need to duplicate the code to tidy and reformat results in both methods (data frame creation in this case, unzipping and `asJava` in the Python model wrapper) or factor it out to a separate method. Let me know how you want me to proceed here. I agree that updating the docs makes sense and will make it clearer to future maintainers as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79139237 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala --- @@ -227,7 +227,7 @@ class Word2VecModel private[ml] ( */ @Since("1.5.0") def findSynonyms(word: String, num: Int): DataFrame = { -findSynonyms(wordVectors.transform(word), num) +findSynonyms(wordVectors.transform(word), num, Some(word)) --- End diff -- I think you don't need to or want to change this file or the wrapper class below. They can continue to plumb through the API calls as before because in the underlying class you handle both cases. You might update docs in these classes, however, to match your change to the .mllib class. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79145091 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -563,11 +580,16 @@ class Word2VecModel private[spark] ( ind += 1 } +// NB: This code (and the documented behavior of findSynonyms --- End diff -- Oh, is the problem that several words, including the word itself, may have 1.0 similarity? and so the word itself may not sort as the top result among the 1.0 results? Yeah, then the code below doesn't quite handle that case. (But below you wouldn't need the `take(num + 1)` now I think?) What about just ... ``` val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2) val filtered = wordOpt match { Some(word) => scored.take(num + 1).filter(p => word != p._1) None => scored } filtered.take(num).toArray ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79139615 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -518,25 +518,42 @@ class Word2VecModel private[spark] ( } /** - * Find synonyms of a word + * Find synonyms of a word; do not include the word itself in results. * @param word a word * @param num number of synonyms to find * @return array of (word, cosineSimilarity) */ @Since("1.1.0") def findSynonyms(word: String, num: Int): Array[(String, Double)] = { val vector = transform(word) -findSynonyms(vector, num) +findSynonyms(vector, num, Some(word)) } /** - * Find synonyms of the vector representation of a word + * Find synonyms of the vector representation of a word, possibly + * including any words in the model vocabulary whose vector respresentation + * is the supplied vector. * @param vector vector representation of a word * @param num number of synonyms to find * @return array of (word, cosineSimilarity) */ @Since("1.1.0") def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = { +findSynonyms(vector, num, None) + } + + /** + * Find synonyms of the vector representation of a word, rejecting + * words identical to the value of wordOpt, if one is supplied. + * @param vector vector representation of a word + * @param num number of synonyms to find + * @param wordOpt optionally, a word to reject from the results list + * @return array of (word, cosineSimilarity) + */ + private[spark] def findSynonyms( --- End diff -- This can be private then, I think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79006879 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- Oops, "tail" is wrong, yeah, because it would give you the bottom n out of n+1, when you just want the first n out of the n+1. Otherwise I think this works? Anyway, agree with the change you propose. If specifying a vector I would not expect any filtering of the first element. We're changing the behavior no matter what here but it's a fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r79005872 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- Consider the case where you're passing in a vector that is extremely similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. (A concrete example is (in this test I added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].) In this case, `wordOpt` is not defined (because you are querying for words whose vector representations are similar to a given vector) but you nonetheless are not interested in discarding the best match because it is not a trivial match (that is, it is not going to be your query term). Related to your other comment (and discussion elsewhere on this PR), I think we could make a case for changing the documented behavior (especially since it is only documented as such in pyspark) in the case where `findSynonyms` is invoked with the vector representation of a word that is _in_ the vocabulary. Instead of rejecting the best match in that case, we could return it. The argument there is that such a result is telling you something you didn't necessarily know (otherwise, you'd probably be querying for a word and not a vector) and that there is an easy way to identify that such a match is trivially the best match. I recognize that changing documented behavior is a big deal, but in this case it seems like it could be the best way to address a minor bug. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78999711 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- It should be equivalent to what you suggest. Something like... ``` val topn1 = wordList.zip(cosVec).toSeq.sortBy(-_._2).take(num + 1) if (wordOpt.isDefined) { topn1.filter(tup => !wordOpt.get.equals(tup._1)).take(num) } else { topn1.tail } ``` OK that's a bit different than what you suggested but does that help a bit? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78995496 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @srowen So actually reverting to the old code but filtering only if `wordOpt` is defined doesn't handle the original case I was considering here, where you pass in a vector that is very similar to the representation of a word in the vocabulary but that is not itself the representation of a word in the vocabulary. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78983451 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- I wasn't sure if the todo was referring to the sorting or to earlier optimizations. I'll get it started as a separate issue; thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78982095 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- Yeah it's not efficient, and there's a todo about it above. I think you could treat it as a separate issue. I'm pretty certain it's a win. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78981374 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala --- @@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext { assert(syms(1)._1 == "japan") } + test("findSynonyms doesn't reject similar word vectors when called with a vector") { +val num = 2 +val word2VecMap = Map( + ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)), + ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)), + ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)), + ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f)) +) +val model = new Word2VecModel(word2VecMap) +val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num) --- End diff -- Yes to both (the two significant digits was simply following the style earlier in the test). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78981066 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @srowen This code in general kind of bothers me (I'd rather see a single pass through the tuples with a bounded priority queue keeping track of the `num + 1` candidates than converting to a sequence and then allocating an array to sort in place). But I'm inclined to get some numbers showing that that is a good idea and make it a separate PR unless this is a good time to fold it in (so to speak). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user willb commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78973934 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- @hhbyyh So [one of the python doctests](https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/feature.py#L575) depends on a word not being a synonym of its vector representation. I think since this is the documented behavior now, that's the direction the fix should go as well, but I'll use `Vector.equals` instead of checking similarity in any case. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78905623 --- Diff: mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala --- @@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with MLlibTestSparkContext { assert(syms(1)._1 == "japan") } + test("findSynonyms doesn't reject similar word vectors when called with a vector") { +val num = 2 +val word2VecMap = Map( + ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)), + ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)), + ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)), + ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f)) +) +val model = new Word2VecModel(word2VecMap) +val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 0.50d)), num) --- End diff -- Nit, but I think "d" is redundant. Also "0.5" seems clearer than "0.50" but this is truly up to taste. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15105#discussion_r78905531 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala --- @@ -566,8 +582,8 @@ class Word2VecModel private[spark] ( wordList.zip(cosVec) .toSeq .sortBy(-_._2) - .take(num + 1) - .tail + .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) && tup._2 != 1.0d) --- End diff -- It's probably easier still to leave the current code in place, up to "tail". If wordOpt is defined, then apply filter and take(num). If not, apply tail. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...
GitHub user willb opened a pull request: https://github.com/apache/spark/pull/15105 [SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. You can merge this pull request into a Git repository by running: $ git pull https://github.com/willb/spark fix/findSynonyms Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15105 commit 5a7cf31e1c76c2fe4e172c2889fa23deb2152af9 Author: William BentonDate: 2016-09-14T17:07:14Z test for spurious rejection of similar vectors in findSynonyms commit 757ce7c73e76f751395d55de3b5b31fa4a05cde8 Author: William Benton Date: 2016-09-14T21:35:29Z `Word2VecModel.findSynonyms` no longer spuriously rejects the best match Previously, the `findSynonyms` method in `Word2VecModel` rejected the closest-matching vector. This was typically correct in cases where we were searching for synonyms of a word, but was incorrect in cases where we were searching for words most similar to a given vector, since the given vector might not correspond to a word in the model's vocabulary. With this commit, `findSynonyms` will not discard the best matching term unless the best matching word is also the query word (or is maximally similar to the query vector). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org