Hi, I am running into OOM problems while training a Spark ML RandomForestClassifier (maxDepth of 30, 32 maxBins, 100 trees).
My dataset is arguably pretty big given the executor count and size (8x5G), with approximately 20M rows and 130 features. The "fun fact" is that a single DecisionTreeClassifier with the same specs (same maxDepth and maxBins) is able to train without problems in a couple of minutes. AFAIK the current random forest implementation grows each tree sequentially, which means that DecisionTreeClassifiers are fit one by one, and therefore the training process should be similar in terms of memory consumption. Am I missing something here? Thanks Julio