Louis Liu created SPARK-13103: --------------------------------- Summary: HashTF dosn't count TF correctly Key: SPARK-13103 URL: https://issues.apache.org/jira/browse/SPARK-13103 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.6.0 Environment: Ubuntu 14.04 Python 3.4.3 Reporter: Louis Liu
I wrote a Python program to calculate frequencies of n-gram sequences with HashTF. But it generate a strange output. It found more "一一下嗎" than "一一下". HashTF gets words' index with hash() But hashes of some Chinese words are negative. Ex: >>> hash('一一下嗎') -6433835193350070115 >>> hash('一一下') -5938108283593463272 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org