Just to add rdd.take(1) won't trigger the entire computation, it will just pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files to trigger the complete pipeline. How many partitions do you see in the last stage?
Thanks Best Regards On Tue, Aug 4, 2015 at 7:10 AM, ayan guha <[email protected]> wrote: > Is your data skewed? What happens if you do rdd.count()? > On 4 Aug 2015 05:49, "Jasleen Kaur" <[email protected]> wrote: > >> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not >> an option due to permission issues). >> >> - num-executors 800 >> - spark.akka.frameSize=1024 >> - spark.default.parallelism=25600 >> - driver-memory=4G >> - executor-memory=32G. >> - My input size is around 1.5TB. >> >> My problem is when I execute rdd.saveAsTextFile(outputPath, >> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not >> an option, I have tried saveAsSequenceFile with GZIP, >> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the >> other hand if I execute rdd.take(1). I get no such issue. So I am assuming >> that issue is due to write. >> >
