On Sun, Jan 11, 2015 at 9:46 PM, Christopher Thom
<christopher.t...@quantium.com.au> wrote:
> Is there any plan to extend the data types that would be accepted by the Tree 
> models in Spark? e.g. Many models that we build contain a large number of 
> string-based categorical factors. Currently the only strategy is to map these 
> string values to integers, and store the mapping so the data can be remapped 
> when the model is scored. A viable solution, but cumbersome for models with 
> hundreds of these kinds of factors.

I think there is nothing on the roadmap, except that in the newer ML
API (the bits under spark.ml), there's fuller support for the idea of
a pipeline of transformations, of which performing this encoding could
be one step.

Since it's directly relevant, I don't mind mentioning that we did
build this sort of logic on top of MLlib and PMML. There's nothing
hard about it, just a pile of translation and counting code, such as
in 
https://github.com/OryxProject/oryx/blob/master/oryx-app-common/src/main/java/com/cloudera/oryx/app/rdf/RDFPMMLUtils.java

So there are bits you can reuse out there especially if your goal is
to get to PMML, which will want to represent all the actual
categorical values in its DataDictionary and not encodings.


> Concerning missing data, I haven't been able to figure out how to use NULL 
> values in LabeledPoints, and I'm not sure whether DecisionTrees correctly 
> handle the case of missing data. The only thing I've been able to work out is 
> to use a placeholder value,

Yes, I don't think that's supported. In model training, you can simply
ignore data that can't reach the node because it lacks a feature
needed in a decision rule. This is OK as long as not that much data is
missing.

In scoring you can't not-answer. Again if you refer to PMML, you can
see some ideas about how to handle this:
http://www.dmg.org/v4-2-1/TreeModel.html#xsdType_MISSING-VALUE-STRATEGY

- Make no prediction
- Just copy the last prediction
- Use a model-supplied default for the node
- Use some confidence weighted combination of the answer you'd get by
following both paths

I have opted, in the past, for simply defaulting to the subtree with
more training examples. All of these strategies are approximations,
yes.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to