[jira] [Commented] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-10 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16608903#comment-16608903
 ] 

Hyukjin Kwon commented on SPARK-25365:
--

Questions should go to mailing list. Please see 
https://spark.apache.org/community.html. I believe you could have a better 
answer.

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25365
> URL: https://issues.apache.org/jira/browse/SPARK-25365
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25365) a better way to handle vector index and sparsity in FeatureHasher implementation ?

2018-09-07 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16606746#comment-16606746
 ] 

Vincent commented on SPARK-25365:
-

[~nick.pentre...@gmail.com] Thanks.

> a better way to handle vector index and sparsity in FeatureHasher 
> implementation ?
> --
>
> Key: SPARK-25365
> URL: https://issues.apache.org/jira/browse/SPARK-25365
> Project: Spark
>  Issue Type: Question
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Vincent
>Priority: Major
>
> In the current implementation of FeatureHasher.transform, a simple modulo on 
> the hashed value is used to determine the vector index, it's suggested to use 
> a large integer value as the numFeature parameter
> we found several issues regarding current implementation: 
>  # Cannot get the feature name back by its index after featureHasher 
> transform, for example. when getting feature importance from decision tree 
> training followed by a FeatureHasher
>  # when index conflict, which is a great chance to happen especially when 
> 'numFeature' is relatively small, its value would be updated with the sum of 
> current and old value, ie, the value of the conflicted feature vector would 
> be change by this module.
>  #  to avoid confliction, we should set the 'numFeature' with a large number, 
> highly sparse vector increase the computation complexity of model training
> we are working on fixing these problems due to our business need, thinking it 
> might or might not be an issue for others as well, we'd like to hear from the 
> community.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org