hi,

when I am trying to join several tables, then write result to another
table, it runs very slow.  by observing worker log and spark ui, I found
many gc time.

the input tables are not very big, their size are:
84M
705M
2.7G
2.4M
573M

the resulting output is about 1.5GB.
the worker is given 70G memory(only 1 worker), and I set spark to use Kryo.
I don't understand the reason why there are so many gc, that makes job very
slow.

when using spark core api, I can call RDD.cache(), than watch how much
memory the rdd used,  in hive on spark, are there anyway to profile memory
usage?

Reply via email to