Hi,

I'm running Spark on YARN carrying out a simple reduceByKey followed by another 
reduceByKey after some transformations. After completing the first stage my 
Master runs out of memory.
I have 20G assigned to the master, 145 executors (12G  each  +4G overhead) , 
around 90k input files, 10+TB data, and 2000 reducers AND no Caching. 

Below are the are two reduceByKey calls

         val myrdd = field1And2.map(x => ( x,1)).reduceByKey(_+_, 2000)

The second one feeds off of the first one 

    val   countHistogram = myrdd.map(x => (x._2,1)).reduceByKey(_+_, 2000)


Any idea what that master is doing gorging so much of data filling up its 
space? There's no collect kind of call that can get the data back to the 
master. 


Thanks,
Vipul

Reply via email to