I think it's because otherwise you would not be able to consider, at least,
K-1 splits among K features, and you want to be able to do that. There may
be more technical reasons in the code that this is strictly enforced, but
it seems like a decent idea. Agree, more than K doesn't seem to help, but,
that won't matter much - you'll still get K-1 possible splits. The value is
global to the whole tree so may need to be higher for other categorical
features, or of course for continuous features.

I don't think this relates to preprocessing. It's a property of the tree.

On Wed, Jun 16, 2021 at 1:33 AM Reed Villanueva <villanuevar...@gmail.com>
wrote:

> Why does sparkml's random forest classifier not support maxBins
> <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.maxBins>
>  (M)
> < (K) number of total categorical values?
>
> My understanding of decision tree bins is that...
>
> Statistical data binning is basically a form of quantization where you map
>> a set of numbers with continuous values into *smaller*, more manageable
>> “bins.”
>
>
> https://clevertap.com/blog/numerical-vs-categorical-variables-decision-trees/
>
> ...which makes it seem like you wouldn't ever really want to use M > K in
> any case, yet the docs seem to imply that is not the case.
>
> Must be >=2 and >= number of categories for any categorical feature
>
> Plus, when I use the random forest implementation in H2O
> <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html>, I
> do have the option of using less bins that the total number of distinct
> categorical values.
>
> Could anyone explain the reason for this restriction in spark? Is there
> some kind of particular data preprocessing / feature engineering users are
> expected to have done beforehand? Am I misunderstanding something about
> decision trees (eg. is it categorical don't really ever *need* to be
> binned in the first place and the setting is just for numerical values or
> something)?
>

Reply via email to