Hi, I wanted to understand what kind of memory overheads are expected if at all while using the Java API. My application seems to have a lot of live Tuple2 instances and I am hitting a lot of gc so I am wondering if I am doing something fundamentally wrong. Here is what the top of my heap looks like. I actually create reifier.tuple.Tuple objects and pass them to map methods and mostly return Tuple2<Tuple,Tuple>. The heap seems to have far too many Tuple2 and $colon$colon.
num #instances #bytes class name ---------------------------------------------- 1: 85414872 2049956928 scala.collection.immutable.$colon$colon 2: 85414852 2049956448 scala.Tuple2 3: 304221 14765832 [C 4: 302923 7270152 java.lang.String 5: 44111 2624624 [Ljava.lang.Object; 6: 1210 1495256 [B 7: 39839 956136 java.util.ArrayList 8: 29 950736 [Lscala.concurrent.forkjoin.ForkJoinTask; 9: 8129 827792 java.lang.Class 10: 33839 812136 java.lang.Long 11: 33400 801600 reifier.tuple.Tuple 12: 6116 538208 java.lang.reflect.Method 13: 12767 408544 java.util.concurrent.ConcurrentHashMap$Node 14: 5994 383616 org.apache.spark.scheduler.ResultTask 15: 10298 329536 java.util.HashMap$Node 16: 11988 287712 org.apache.spark.rdd.NarrowCoGroupSplitDep 17: 5708 228320 reifier.block.Canopy 18: 9 215784 [Lscala.collection.Seq; 19: 12078 193248 java.lang.Integer 20: 12019 192304 java.lang.Object 21: 5708 182656 reifier.block.Tree 22: 2776 173152 [I 23: 6013 144312 scala.collection.mutable.ArrayBuffer 24: 5994 143856 [Lorg.apache.spark.rdd.CoGroupSplitDep; 25: 5994 143856 org.apache.spark.rdd.CoGroupPartition 26: 5994 143856 org.apache.spark.rdd.ShuffledRDDPartition 27: 4486 143552 java.util.Hashtable$Entry 28: 6284 132800 [Ljava.lang.Class; 29: 1819 130968 java.lang.reflect.Field 30: 605 101208 [Ljava.util.HashMap$Node; Best Regards, Sonal Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal>