Lots of tuples get used in Spark itself and its Scala-based implementation anyway. This is the type of (a,b) in Scala. You can use the profiler to see where these are being allocated but my guess it is not just in translating to/from Java.
You can also call directly into Scala implementations directly, usually, in Java, if it helps. On Thu, Oct 30, 2014 at 5:41 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > Thanks Koert. These numbers indeed tie back to our data and algorithms. > Would going the scala route save some memory, as the java API creates > wrapper Tuple2 for all pair functions? > > > On Wednesday, October 29, 2014, Koert Kuipers <ko...@tresata.com> wrote: >> >> since spark holds data structures on heap (and by default tries to work >> with all data in memory) and its written in Scala seeing lots of scala >> Tuple2 is not unexpected. how do these numbers relate to your data size? >> >> On Oct 27, 2014 2:26 PM, "Sonal Goyal" <sonalgoy...@gmail.com> wrote: >>> >>> Hi, >>> >>> I wanted to understand what kind of memory overheads are expected if at >>> all while using the Java API. My application seems to have a lot of live >>> Tuple2 instances and I am hitting a lot of gc so I am wondering if I am >>> doing something fundamentally wrong. Here is what the top of my heap looks >>> like. I actually create reifier.tuple.Tuple objects and pass them to map >>> methods and mostly return Tuple2<Tuple,Tuple>. The heap seems to have far >>> too many Tuple2 and $colon$colon. >>> >>> >>> num #instances #bytes class name >>> ---------------------------------------------- >>> 1: 85414872 2049956928 >>> scala.collection.immutable.$colon$colon >>> 2: 85414852 2049956448 scala.Tuple2 >>> 3: 304221 14765832 [C >>> 4: 302923 7270152 java.lang.String >>> 5: 44111 2624624 [Ljava.lang.Object; >>> 6: 1210 1495256 [B >>> 7: 39839 956136 java.util.ArrayList >>> 8: 29 950736 >>> [Lscala.concurrent.forkjoin.ForkJoinTask; >>> 9: 8129 827792 java.lang.Class >>> 10: 33839 812136 java.lang.Long >>> 11: 33400 801600 reifier.tuple.Tuple >>> 12: 6116 538208 java.lang.reflect.Method >>> 13: 12767 408544 >>> java.util.concurrent.ConcurrentHashMap$Node >>> 14: 5994 383616 org.apache.spark.scheduler.ResultTask >>> 15: 10298 329536 java.util.HashMap$Node >>> 16: 11988 287712 >>> org.apache.spark.rdd.NarrowCoGroupSplitDep >>> 17: 5708 228320 reifier.block.Canopy >>> 18: 9 215784 [Lscala.collection.Seq; >>> 19: 12078 193248 java.lang.Integer >>> 20: 12019 192304 java.lang.Object >>> 21: 5708 182656 reifier.block.Tree >>> 22: 2776 173152 [I >>> 23: 6013 144312 scala.collection.mutable.ArrayBuffer >>> 24: 5994 143856 >>> [Lorg.apache.spark.rdd.CoGroupSplitDep; >>> 25: 5994 143856 org.apache.spark.rdd.CoGroupPartition >>> 26: 5994 143856 >>> org.apache.spark.rdd.ShuffledRDDPartition >>> 27: 4486 143552 java.util.Hashtable$Entry >>> 28: 6284 132800 [Ljava.lang.Class; >>> 29: 1819 130968 java.lang.reflect.Field >>> 30: 605 101208 [Ljava.util.HashMap$Node; >>> >>> >>> >>> Best Regards, >>> Sonal >>> Nube Technologies >>> >>> >>> >>> > > > -- > Best Regards, > Sonal > Nube Technologies > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org