YongGang Cao created SPARK-12153:
------------------------------------

             Summary: Word2Vec uses a fixed length for sentences which is not 
reasonable for reality, and similarity functions and fields are not accessible
                 Key: SPARK-12153
                 URL: https://issues.apache.org/jira/browse/SPARK-12153
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.5.2
            Reporter: YongGang Cao
            Priority: Minor


sentence boundary matters for sliding window, we shouldn't train model from a 
window across sentences. the current 100 word as a hard split for sentences 
doesn't really make sense.
And the cosinesimilarity functions is private which is useless for caller. 
we may need to access the vocabulary and wordindex table as well, those need 
getters

I made changes to address above issues. will send out pull request for your 
review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to