Hi, If you get stuck in job fails, one of best practices is to increase #partitions. Also, you'd better off using DataFrame instread of RDD in terms of join optimization.
// maropu On Thu, May 26, 2016 at 11:40 PM, Priya Ch <learnings.chitt...@gmail.com> wrote: > Hello Team, > > > I am trying to perform join 2 rdds where one is of size 800 MB and the > other is 190 MB. During the join step, my job halts and I don't see > progress in the execution. > > This is the message I see on console - > > INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output > locations for shuffle 0 to <hostname1>:40000 > INFO spark.MapOutputTrackerMasterEndPoint: Asked to send map output > locations for shuffle 1 to <hostname2>:40000 > > After these messages, I dont see any progress. I am using Spark 1.6.0 > version and yarn scheduler (running in YARN client mode). My cluster > configurations is - 3 node cluster (1 master and 2 slaves). Each slave has > 1 TB hard disk space, 300GB memory and 32 cores. > > HDFS block size is 128 MB. > > Thanks, > Padma Ch > -- --- Takeshi Yamamuro