Antonio Murgia created SPARK-10105: -------------------------------------- Summary: Adding most k frequent words parameter to Word2Vec implementation Key: SPARK-10105 URL: https://issues.apache.org/jira/browse/SPARK-10105 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Antonio Murgia Priority: Minor
When training Word2Vec on a really big dataset, it's really hard to evaluate the right minCount parameter, it would really help having a parameter to choose how many words you want to be in the vocabulary. Furthermore, the original Word2Vec paper, state that they took into account the first 30k words. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org