Hi,

I am running into OOM problems while training a Spark ML
RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).

My dataset is arguably pretty big given the executor count and size (8x5G),
with approximately 20M rows and 130 features.

The "fun fact" is that a single DecisionTreeClassifier with the same specs
(same maxDepth and maxBins) is able to train without problems in a couple
of minutes.

AFAIK the current random forest implementation grows each tree
sequentially, which means that DecisionTreeClassifiers are fit one by one,
and therefore the training process should be similar in terms of memory
consumption. Am I missing something here?

Thanks
Julio

Reply via email to