hi, when I am trying to join several tables, then write result to another table, it runs very slow. by observing worker log and spark ui, I found many gc time.
the input tables are not very big, their size are: 84M 705M 2.7G 2.4M 573M the resulting output is about 1.5GB. the worker is given 70G memory(only 1 worker), and I set spark to use Kryo. I don't understand the reason why there are so many gc, that makes job very slow. when using spark core api, I can call RDD.cache(), than watch how much memory the rdd used, in hive on spark, are there anyway to profile memory usage?
