I *think* solved issue.
Will update w/ details after further testing / inspection.

On Mon, Jun 14, 2021 at 8:50 PM Reed Villanueva <villanuevar...@gmail.com>
wrote:

> What happens if a random forest "max bins" hyperparameter is set too high?
>
> When training a sparkml random forest (
> https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier
> ) with maxBins set roughly equal to the max number of distinct
> categorical values for any given feature I see OK performance metrics. But
> when I set it closer to 2x or 3x the number of distinct categorical values,
> performance is terrible (eg. accuracy (in the case of a binary classifier)
> being no better than just the actual distribution of responses in the
> dataset) and the feature importances (
> https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.RandomForestClassificationModel.featureImportances
> ) being all zeros (as opposed to when using the lower initial maxBins value
> where it at does show *something* for the importances).
>
> I would not think that there would be such a huge difference just from a
> change in max bins like this (esp. the difference in seeing *something* vs
> absolutely nothing / all zeros for the feature importances).
>
> What could be happening under the hood of the algo that causes such
> different outcomes when this parameter is changed like this?
>

Reply via email to