I think it's because otherwise you would not be able to consider, at least, K-1 splits among K features, and you want to be able to do that. There may be more technical reasons in the code that this is strictly enforced, but it seems like a decent idea. Agree, more than K doesn't seem to help, but, that won't matter much - you'll still get K-1 possible splits. The value is global to the whole tree so may need to be higher for other categorical features, or of course for continuous features.
I don't think this relates to preprocessing. It's a property of the tree. On Wed, Jun 16, 2021 at 1:33 AM Reed Villanueva <villanuevar...@gmail.com> wrote: > Why does sparkml's random forest classifier not support maxBins > <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassifier.html#pyspark.ml.classification.RandomForestClassifier.maxBins> > (M) > < (K) number of total categorical values? > > My understanding of decision tree bins is that... > > Statistical data binning is basically a form of quantization where you map >> a set of numbers with continuous values into *smaller*, more manageable >> “bins.” > > > https://clevertap.com/blog/numerical-vs-categorical-variables-decision-trees/ > > ...which makes it seem like you wouldn't ever really want to use M > K in > any case, yet the docs seem to imply that is not the case. > > Must be >=2 and >= number of categories for any categorical feature > > Plus, when I use the random forest implementation in H2O > <https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html>, I > do have the option of using less bins that the total number of distinct > categorical values. > > Could anyone explain the reason for this restriction in spark? Is there > some kind of particular data preprocessing / feature engineering users are > expected to have done beforehand? Am I misunderstanding something about > decision trees (eg. is it categorical don't really ever *need* to be > binned in the first place and the setting is just for numerical values or > something)? >