Lots of tuples get used in Spark itself and its Scala-based
implementation anyway. This is the type of (a,b) in Scala. You can use
the profiler to see where these are being allocated but my guess it is
not just in translating to/from Java.

You can also call directly into Scala implementations directly,
usually, in Java, if it helps.

On Thu, Oct 30, 2014 at 5:41 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote:
> Thanks Koert. These numbers indeed tie back to our data and algorithms.
> Would going the scala route save some memory, as the java API creates
> wrapper Tuple2 for all pair functions?
>
>
> On Wednesday, October 29, 2014, Koert Kuipers <ko...@tresata.com> wrote:
>>
>> since spark holds data structures on heap (and by default tries to work
>> with all data in memory) and its written in Scala seeing lots of scala
>> Tuple2 is not unexpected. how do these numbers relate to your data size?
>>
>> On Oct 27, 2014 2:26 PM, "Sonal Goyal" <sonalgoy...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I wanted to understand what kind of memory overheads are expected if at
>>> all while using the Java API. My application seems to have a lot of live
>>> Tuple2 instances and I am hitting a lot of gc so I am wondering if I am
>>> doing something fundamentally wrong. Here is what the top of my heap looks
>>> like. I actually create reifier.tuple.Tuple objects and pass them to map
>>> methods and mostly return Tuple2<Tuple,Tuple>. The heap seems to have far
>>> too many Tuple2 and $colon$colon.
>>>
>>>
>>> num     #instances         #bytes  class name
>>> ----------------------------------------------
>>>    1:      85414872     2049956928
>>> scala.collection.immutable.$colon$colon
>>>    2:      85414852     2049956448  scala.Tuple2
>>>    3:        304221       14765832  [C
>>>    4:        302923        7270152  java.lang.String
>>>    5:         44111        2624624  [Ljava.lang.Object;
>>>    6:          1210        1495256  [B
>>>    7:         39839         956136  java.util.ArrayList
>>>    8:            29         950736
>>> [Lscala.concurrent.forkjoin.ForkJoinTask;
>>>    9:          8129         827792  java.lang.Class
>>>   10:         33839         812136  java.lang.Long
>>>   11:         33400         801600  reifier.tuple.Tuple
>>>   12:          6116         538208  java.lang.reflect.Method
>>>   13:         12767         408544
>>> java.util.concurrent.ConcurrentHashMap$Node
>>>   14:          5994         383616  org.apache.spark.scheduler.ResultTask
>>>   15:         10298         329536  java.util.HashMap$Node
>>>   16:         11988         287712
>>> org.apache.spark.rdd.NarrowCoGroupSplitDep
>>>   17:          5708         228320  reifier.block.Canopy
>>>   18:             9         215784  [Lscala.collection.Seq;
>>>   19:         12078         193248  java.lang.Integer
>>>   20:         12019         192304  java.lang.Object
>>>   21:          5708         182656  reifier.block.Tree
>>>   22:          2776         173152  [I
>>>   23:          6013         144312  scala.collection.mutable.ArrayBuffer
>>>   24:          5994         143856
>>> [Lorg.apache.spark.rdd.CoGroupSplitDep;
>>>   25:          5994         143856  org.apache.spark.rdd.CoGroupPartition
>>>   26:          5994         143856
>>> org.apache.spark.rdd.ShuffledRDDPartition
>>>   27:          4486         143552  java.util.Hashtable$Entry
>>>   28:          6284         132800  [Ljava.lang.Class;
>>>   29:          1819         130968  java.lang.reflect.Field
>>>   30:           605         101208  [Ljava.util.HashMap$Node;
>>>
>>>
>>>
>>> Best Regards,
>>> Sonal
>>> Nube Technologies
>>>
>>>
>>>
>>>
>
>
> --
> Best Regards,
> Sonal
> Nube Technologies
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to