GitHub user willb opened a pull request:

    https://github.com/apache/spark/pull/15150

    Use a bounded priority queue to find synonyms in Word2VecModel

    ## What changes were proposed in this pull request?
    
    The code in `Word2VecModel.findSynonyms` to choose the vocabulary elements 
with the highest similarity to the query vector currently sorts the collection 
of similarities for every vocabulary element. This involves making multiple 
copies of the collection of similarities while doing a (relatively) expensive 
sort. It would be more efficient to find the best matches by maintaining a 
bounded priority queue and populating it with a single pass over the 
vocabulary, and that is exactly what this patch does.
    
    ## How was this patch tested?
    
    This patch adds no user-visible functionality and its correctness should be 
exercised by existing tests.  To ensure that this approach is actually faster, 
I made a microbenchmark for `findSynonyms`:
    
    ```
    object W2VTiming {
      import org.apache.spark.{SparkContext, SparkConf}
      import org.apache.spark.mllib.feature.Word2VecModel
      def run(modelPath: String, scOpt: Option[SparkContext] = None) {
        val sc = scOpt.getOrElse(new SparkContext(new 
SparkConf(true).setMaster("local[*]").setAppName("test")))
        val model = Word2VecModel.load(sc, modelPath)
        val keys = model.getVectors.keys
        val start = System.currentTimeMillis
        for(key <- keys) {
          model.findSynonyms(key, 5)
          model.findSynonyms(key, 10)
          model.findSynonyms(key, 25)
          model.findSynonyms(key, 50)
        }
        val finish = System.currentTimeMillis
        println("run completed in " + (finish - start) + "ms")
      }
    }
    ```
    
    I ran this test on a model generated from the complete works of Jane Austen 
and found that the new approach was over 3x faster than the old approach.  (As 
the `num` argument to `findSynonyms` approaches the vocabulary size, the new 
approach will have less of an advantage over the old one.)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/willb/spark SPARK-17595

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15150
    
----
commit ddba6577a6ece67f77b3757da859c88fa9065c04
Author: William Benton <wi...@redhat.com>
Date:   2016-09-19T14:35:57Z

    Use a bounded priority queue to find synonyms in Word2VecModel

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to