Hi, I'm running Spark on YARN carrying out a simple reduceByKey followed by another reduceByKey after some transformations. After completing the first stage my Master runs out of memory. I have 20G assigned to the master, 145 executors (12G each +4G overhead) , around 90k input files, 10+TB data, and 2000 reducers AND no Caching.
Below are the are two reduceByKey calls val myrdd = field1And2.map(x => ( x,1)).reduceByKey(_+_, 2000) The second one feeds off of the first one val countHistogram = myrdd.map(x => (x._2,1)).reduceByKey(_+_, 2000) Any idea what that master is doing gorging so much of data filling up its space? There's no collect kind of call that can get the data back to the master. Thanks, Vipul