I *think* solved issue. Will update w/ details after further testing / inspection.
On Mon, Jun 14, 2021 at 8:50 PM Reed Villanueva <villanuevar...@gmail.com> wrote: > What happens if a random forest "max bins" hyperparameter is set too high? > > When training a sparkml random forest ( > https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier > ) with maxBins set roughly equal to the max number of distinct > categorical values for any given feature I see OK performance metrics. But > when I set it closer to 2x or 3x the number of distinct categorical values, > performance is terrible (eg. accuracy (in the case of a binary classifier) > being no better than just the actual distribution of responses in the > dataset) and the feature importances ( > https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.classification.RandomForestClassificationModel.html?highlight=featureimportances#pyspark.ml.classification.RandomForestClassificationModel.featureImportances > ) being all zeros (as opposed to when using the lower initial maxBins value > where it at does show *something* for the importances). > > I would not think that there would be such a huge difference just from a > change in max bins like this (esp. the difference in seeing *something* vs > absolutely nothing / all zeros for the feature importances). > > What could be happening under the hood of the algo that causes such > different outcomes when this parameter is changed like this? >