Re: Writing to HDFS

Akhil Das Tue, 04 Aug 2015 02:09:07 -0700

Just to add rdd.take(1) won't trigger the entire computation, it will just
pull out the first record. You need to do a rdd.count() or rdd.saveAs*Files
to trigger the complete pipeline. How many partitions do you see in the
last stage?


Thanks
Best Regards

On Tue, Aug 4, 2015 at 7:10 AM, ayan guha <[email protected]> wrote:

> Is your data skewed? What happens if you do rdd.count()?
> On 4 Aug 2015 05:49, "Jasleen Kaur" <[email protected]> wrote:
>
>> I am executing a spark job on a cluster as a yarn-client(Yarn cluster not
>> an option due to permission issues).
>>
>>    - num-executors 800
>>    - spark.akka.frameSize=1024
>>    - spark.default.parallelism=25600
>>    - driver-memory=4G
>>    - executor-memory=32G.
>>    - My input size is around 1.5TB.
>>
>> My problem is when I execute rdd.saveAsTextFile(outputPath,
>> classOf[org.apache.hadoop.io.compress.SnappyCodec])(Saving as avro also not
>> an option, I have tried saveAsSequenceFile with GZIP,
>> saveAsNewAPIHadoopFile with same result), I get heap space issue. On the
>> other hand if I execute rdd.take(1). I get no such issue. So I am assuming
>> that issue is due to write.
>>
>

Re: Writing to HDFS

Reply via email to