[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-17 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15105


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79158292
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -227,7 +227,7 @@ class Word2VecModel private[ml] (
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+findSynonyms(wordVectors.transform(word), num, Some(word))
--- End diff --

I see. I agree that something gets a little bit duplicated no matter what. 
Given your change, it seems easiest to pass the string vs vector argument all 
the way down, even if that means in these other two classes you duplicate a 
little code to transform the dataframe, etc. If it's nontrivial, sure, a little 
helper method seems reasonable. That would help keep this layered more cleanly 
IMHO.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79157601
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -227,7 +227,7 @@ class Word2VecModel private[ml] (
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+findSynonyms(wordVectors.transform(word), num, Some(word))
--- End diff --

In this case (and similarly in 
[`Word2VecModelWrapper`](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/Word2VecModelWrapper.scala))
 I opted to call the three-argument version because the wrappers both 
explicitly convert their argument to a vector before calling `findSynonyms` on 
the underlying model (and so `wordOpt` would not be defined if the wrapper were 
invoked with a word).  If we were to make the three-argument `findSynonyms` 
private we wouldn't be able to share a code path in the wrapper classes and 
would need to duplicate the code to tidy and reformat results in both methods 
(data frame creation in this case, unzipping and `asJava` in the Python model 
wrapper) or factor it out to a separate method.  Let me know how you want me to 
proceed here.

I agree that updating the docs makes sense and will make it clearer to 
future maintainers as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79139237
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Word2Vec.scala 
---
@@ -227,7 +227,7 @@ class Word2VecModel private[ml] (
*/
   @Since("1.5.0")
   def findSynonyms(word: String, num: Int): DataFrame = {
-findSynonyms(wordVectors.transform(word), num)
+findSynonyms(wordVectors.transform(word), num, Some(word))
--- End diff --

I think you don't need to or want to change this file or the wrapper class 
below. They can continue to plumb through the API calls as before because in 
the underlying class you handle both cases. You might update docs in these 
classes, however, to match your change to the .mllib class.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79145091
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -563,11 +580,16 @@ class Word2VecModel private[spark] (
   ind += 1
 }
 
+// NB: This code (and the documented behavior of findSynonyms
--- End diff --

Oh, is the problem that several words, including the word itself, may have 
1.0 similarity? and so the word itself may not sort as the top result among the 
1.0 results? Yeah, then the code below doesn't quite handle that case. (But 
below you wouldn't need the `take(num + 1)` now I think?)

What about just ...

```
val scored = wordList.zip(cosVec).toSeq.sortBy(-_._2)
val filtered = wordOpt match {
  Some(word) => scored.take(num + 1).filter(p => word != p._1)
  None => scored
}
filtered.take(num).toArray
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-16 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79139615
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -518,25 +518,42 @@ class Word2VecModel private[spark] (
   }
 
   /**
-   * Find synonyms of a word
+   * Find synonyms of a word; do not include the word itself in results.
* @param word a word
* @param num number of synonyms to find
* @return array of (word, cosineSimilarity)
*/
   @Since("1.1.0")
   def findSynonyms(word: String, num: Int): Array[(String, Double)] = {
 val vector = transform(word)
-findSynonyms(vector, num)
+findSynonyms(vector, num, Some(word))
   }
 
   /**
-   * Find synonyms of the vector representation of a word
+   * Find synonyms of the vector representation of a word, possibly
+   * including any words in the model vocabulary whose vector 
respresentation
+   * is the supplied vector.
* @param vector vector representation of a word
* @param num number of synonyms to find
* @return array of (word, cosineSimilarity)
*/
   @Since("1.1.0")
   def findSynonyms(vector: Vector, num: Int): Array[(String, Double)] = {
+findSynonyms(vector, num, None)
+  }
+
+  /**
+   * Find synonyms of the vector representation of a word, rejecting
+   * words identical to the value of wordOpt, if one is supplied.
+   * @param vector vector representation of a word
+   * @param num number of synonyms to find
+   * @param wordOpt optionally, a word to reject from the results list
+   * @return array of (word, cosineSimilarity)
+   */
+  private[spark] def findSynonyms(
--- End diff --

This can be private then, I think.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79006879
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

Oops, "tail" is wrong, yeah, because it would give you the bottom n out of 
n+1, when you just want the first n out of the n+1. Otherwise I think this 
works?

Anyway, agree with the change you propose. If specifying a vector I would 
not expect any filtering of the first element. We're changing the behavior no 
matter what here but it's a fix.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r79005872
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

Consider the case where you're passing in a vector that is extremely 
similar to the representation of a word in the vocabulary but that is not 
itself the representation of a word in the vocabulary.  (A concrete example is 
(in this test I 
added)[https://github.com/willb/spark/blob/fix/findSynonyms/mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala#L72].)
  In this case, `wordOpt` is not defined (because you are querying for words 
whose vector representations are similar to a given vector) but you nonetheless 
are not interested in discarding the best match because it is not a trivial 
match (that is, it is not going to be your query term).

Related to your other comment (and discussion elsewhere on this PR), I 
think we could make a case for changing the documented behavior (especially 
since it is only documented as such in pyspark) in the case where 
`findSynonyms` is invoked with the vector representation of a word that is _in_ 
the vocabulary.  Instead of rejecting the best match in that case, we could 
return it. The argument there is that such a result is telling you something 
you didn't necessarily know (otherwise, you'd probably be querying for a word 
and not a vector) and that there is an easy way to identify that such a match 
is trivially the best match.  I recognize that changing documented behavior is 
a big deal, but in this case it seems like it could be the best way to address 
a minor bug.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78999711
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

It should be equivalent to what you suggest. Something like...

```
val topn1 = wordList.zip(cosVec).toSeq.sortBy(-_._2).take(num + 1)
if (wordOpt.isDefined) {
  topn1.filter(tup => !wordOpt.get.equals(tup._1)).take(num)
} else {
  topn1.tail
}
```

OK that's a bit different than what you suggested but does that help a bit?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78995496
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@srowen So actually reverting to the old code but filtering only if 
`wordOpt` is defined doesn't handle the original case I was considering here, 
where you pass in a vector that is very similar to the representation of a word 
in the vocabulary but that is not itself the representation of a word in the 
vocabulary.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78983451
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

I wasn't sure if the todo was referring to the sorting or to earlier 
optimizations.  I'll get it started as a separate issue; thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78982095
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

Yeah it's not efficient, and there's a todo about it above. I think you 
could treat it as a separate issue. I'm pretty certain it's a win.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78981374
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms(1)._1 == "japan")
   }
 
+  test("findSynonyms doesn't reject similar word vectors when called with 
a vector") {
+val num = 2
+val word2VecMap = Map(
+  ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)),
+  ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)),
+  ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)),
+  ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f))
+)
+val model = new Word2VecModel(word2VecMap)
+val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 
0.50d)), num)
--- End diff --

Yes to both (the two significant digits was simply following the style 
earlier in the test).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78981066
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@srowen This code in general kind of bothers me (I'd rather see a single 
pass through the tuples with a bounded priority queue keeping track of the `num 
+ 1` candidates than converting to a sequence and then allocating an array to 
sort in place).  But I'm inclined to get some numbers showing that that is a 
good idea and make it a separate PR unless this is a good time to fold it in 
(so to speak).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread willb
Github user willb commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78973934
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

@hhbyyh So [one of the python 
doctests](https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/feature.py#L575)
 depends on a word not being a synonym of its vector representation.  I think 
since this is the documented behavior now, that's the direction the fix should 
go as well, but I'll use `Vector.equals` instead of checking similarity in any 
case.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78905623
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/feature/Word2VecSuite.scala ---
@@ -68,6 +69,21 @@ class Word2VecSuite extends SparkFunSuite with 
MLlibTestSparkContext {
 assert(syms(1)._1 == "japan")
   }
 
+  test("findSynonyms doesn't reject similar word vectors when called with 
a vector") {
+val num = 2
+val word2VecMap = Map(
+  ("china", Array(0.50f, 0.50f, 0.50f, 0.50f)),
+  ("japan", Array(0.40f, 0.50f, 0.50f, 0.50f)),
+  ("taiwan", Array(0.60f, 0.50f, 0.50f, 0.50f)),
+  ("korea", Array(0.45f, 0.60f, 0.60f, 0.60f))
+)
+val model = new Word2VecModel(word2VecMap)
+val syms = model.findSynonyms(Vectors.dense(Array(0.52d, 0.50d, 0.50d, 
0.50d)), num)
--- End diff --

Nit, but I think "d" is redundant. Also "0.5" seems clearer than "0.50" but 
this is truly up to taste.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-15 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15105#discussion_r78905531
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -566,8 +582,8 @@ class Word2VecModel private[spark] (
 wordList.zip(cosVec)
   .toSeq
   .sortBy(-_._2)
-  .take(num + 1)
-  .tail
+  .filter(tup => wordOpt.map(w => !w.equals(tup._1)).getOrElse(true) 
&& tup._2 != 1.0d)
--- End diff --

It's probably easier still to leave the current code in place, up to 
"tail". If wordOpt is defined, then apply filter and take(num). If not, apply 
tail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15105: [SPARK-17548] [MLlib] Word2VecModel.findSynonyms ...

2016-09-14 Thread willb
GitHub user willb opened a pull request:

https://github.com/apache/spark/pull/15105

[SPARK-17548] [MLlib] Word2VecModel.findSynonyms no longer spuriously 
rejects the best match when invoked with a vector

## What changes were proposed in this pull request?

This pull request changes the behavior of `Word2VecModel.findSynonyms` so 
that it will not spuriously reject the best match when invoked with a vector 
that does not correspond to a word in the model's vocabulary.  Instead of 
blindly discarding the best match, the changed implementation discards a match 
that corresponds to the query word (in cases where `findSynonyms` is invoked 
with a word) or that has an identical angle to the query vector.

## How was this patch tested?

I added a test to `Word2VecSuite` to ensure that the word with the most 
similar vector from a supplied vector would not be spuriously rejected.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/willb/spark fix/findSynonyms

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15105.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15105


commit 5a7cf31e1c76c2fe4e172c2889fa23deb2152af9
Author: William Benton 
Date:   2016-09-14T17:07:14Z

test for spurious rejection of similar vectors in findSynonyms

commit 757ce7c73e76f751395d55de3b5b31fa4a05cde8
Author: William Benton 
Date:   2016-09-14T21:35:29Z

`Word2VecModel.findSynonyms` no longer spuriously rejects the best match

Previously, the `findSynonyms` method in `Word2VecModel` rejected the
closest-matching vector.  This was typically correct in cases where we
were searching for synonyms of a word, but was incorrect in cases
where we were searching for words most similar to a given vector,
since the given vector might not correspond to a word in the model's
vocabulary.

With this commit, `findSynonyms` will not discard the best matching
term unless the best matching word is also the query word (or is
maximally similar to the query vector).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org