Vincent created SPARK-25412:
-------------------------------

             Summary: FeatureHasher would change the value of output feature
                 Key: SPARK-25412
                 URL: https://issues.apache.org/jira/browse/SPARK-25412
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.1
            Reporter: Vincent


In the current implementation of FeatureHasher.transform, a simple modulo on 
the hashed value is used to determine the vector index, it's suggested to use a 
large integer value as the numFeature parameter

we found several issues regarding current implementation: 
 # Cannot get the feature name back by its index after featureHasher transform, 
for example. when getting feature importance from decision tree training 
followed by a FeatureHasher
 # when index conflict, which is a great chance to happen especially when 
'numFeature' is relatively small, its value would be changed with a new valued 
(sum of current and old value)
 #  to avoid confliction, we should set the 'numFeature' with a large number, 
highly sparse vector increase the computation complexity of model training



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to