Ewan Higgs created SPARK-13434:
----------------------------------

             Summary: Reduce Spark RandomForest memory footprint
                 Key: SPARK-13434
                 URL: https://issues.apache.org/jira/browse/SPARK-13434
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.6.0
         Environment: Linux
            Reporter: Ewan Higgs


The RandomForest implementation can easily run out of memory on moderate 
datasets. This was raised in the a user's benchmarking game on github 
(https://github.com/szilard/benchm-ml/issues/19). I looked to see if there was 
a tracking issue, but I couldn't fine one.

Using Spark 1.6, a user of mine is running into problems running the 
RandomForest training on largish datasets on machines with 64G memory and the 
following in {{spark-defaults.conf}}:

{code}
spark.executor.cores 2
spark.executor.instances 199
spark.executor.memory 10240M
{code}

I reproduced the excessive memory use from the benchmark example (using an 
input CSV of 1.3G and 686 columns) in spark shell with {{spark-shell 
--driver-memory 30G --executor-memory 30G}} and have a heap profile from a 
single machine by running {{jmap -histo:live <spark-pid>}}. I took a sample 
every 5 seconds and at the peak it looks like this:

{code}
 num     #instances         #bytes  class name
----------------------------------------------
   1:       5428073     8458773496  [D
   2:      12293653     4124641992  [I
   3:      32508964     1820501984  org.apache.spark.mllib.tree.model.Node
   4:      53068426     1698189632  org.apache.spark.mllib.tree.model.Predict
   5:      72853787     1165660592  scala.Some
   6:      16263408      910750848  
org.apache.spark.mllib.tree.model.InformationGainStats
   7:         72969      390492744  [B
   8:       3327008      133080320  
org.apache.spark.mllib.tree.impl.DTStatsAggregator
   9:       3754500      120144000  scala.collection.immutable.HashMap$HashMap1
  10:       3318349      106187168  org.apache.spark.mllib.tree.model.Split
  11:       3534946       84838704  
org.apache.spark.mllib.tree.RandomForest$NodeIndexInfo
  12:       3764745       60235920  java.lang.Integer
  13:       3327008       53232128  
org.apache.spark.mllib.tree.impurity.EntropyAggregator
  14:        380804       45361144  [C
  15:        268887       34877128  <constMethodKlass>
  16:        268887       34431568  <methodKlass>
  17:        908377       34042760  [Lscala.collection.immutable.HashMap;
  18:       1100000       26400000  
org.apache.spark.mllib.regression.LabeledPoint
  19:       1100000       26400000  org.apache.spark.mllib.linalg.SparseVector
  20:         20206       25979864  <constantPoolKlass>
  21:       1000000       24000000  org.apache.spark.mllib.tree.impl.TreePoint
  22:       1000000       24000000  org.apache.spark.mllib.tree.impl.BaggedPoint
  23:        908332       21799968  
scala.collection.immutable.HashMap$HashTrieMap
  24:         20206       20158864  <instanceKlassKlass>
  25:         17023       14380352  <constantPoolCacheKlass>
  26:            16       13308288  
[Lorg.apache.spark.mllib.tree.impl.DTStatsAggregator;
  27:        445797       10699128  scala.Tuple2
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to