[
https://issues.apache.org/jira/browse/SPARK-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415600#comment-16415600
]
Barry Becker commented on SPARK-6162:
-
If we all agree that is is something that would be very nice to have, why is it
closed as won't fix instead of just being deferred to a future release?
This seems like a big limitation of spark Tree models in Spark.
> Handle missing values in GBM
>
>
> Key: SPARK-6162
> URL: https://issues.apache.org/jira/browse/SPARK-6162
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
>Affects Versions: 1.2.1
>Reporter: Devesh Parekh
>Priority: Major
>
> We build a lot of predictive models over data combined from multiple sources,
> where some entries may not have all sources of data and so some values are
> missing in each feature vector. Another place this might come up is if you
> have features from slightly heterogeneous items (or items composed of
> heterogeneous subcomponents) that share many features in common but may have
> extra features for different types, and you don't want to manually train
> models for every different type.
> R's GBM library, which is what we are currently using, deals with this type
> of data nicely by making "missing" nodes in the decision tree (a surrogate
> split) for features that can have missing values. We'd like to do the same
> with MLLib, but LabeledPoint would need to support missing values, and
> GradientBoostedTrees would need to be modified to deal with them.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org