Chen Lin created SPARK-21005:
--------------------------------

             Summary: VectorIndexerModel does not prepare output column field 
correctly
                 Key: SPARK-21005
                 URL: https://issues.apache.org/jira/browse/SPARK-21005
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 2.1.1
            Reporter: Chen Lin


>From my understanding through reading the documentation,  VectorIndexer 
>decides which features should be categorical based on the number of distinct 
>values, where features with at most maxCategories are declared categorical. 
>Meanwhile, those features which exceed maxCategories are declared continuous. 

Currently, VectorIndexerModel works all right with a dataset which has empty 
schema. However, when VectorIndexerModel is transforming on a dataset with 
`ML_ATTR` metadata, it may not output the expected result. For example, a 
feature with nominal attribute which has distinct values exceeding maxCategorie 
will not be treated as a continuous feature as we expected but still a 
categorical feature. Thus, it may cause all the tree-based algorithms (like 
Decision Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree 
requires maxBins (= $maxPossibleBins) to be at least as large as the number of 
values in each categorical feature, but categorical feature $maxCategory has 
$maxCategoriesPerFeature values. Considering remove this and other categorical 
features with a large number of values, or add more training examples.".

Correct me if my understanding is wrong.
I will submit a PR soon to solve this issue.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to