Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Gourav Sengupta
Hi Saad, May I ask which EMR version and cluster size you are using? Usually if you are using C4.4xlarge systems then they have high disk space as I understand. The other thing that you can do is attach more disk space to the nodes, the option of which is available in the advanced cluster start

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
I have been trying to monitor this while the job is running, I think I forgot to account for the 3-way hdfs replication, so right there the output is more like 21 TB instead of my claimed 7 TB. But it still looks like hdfs is losing more disk space than can be account for by just the output, going

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
Thanks. I checked and it is using another s3 folder for the temporary restore space. The underlying code insists on the snapshot and the restore directory being on the same filesystem, so it is using Emrfs for both. So unless Emrfs is under the covers using some local disk space it doesn't seem

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Jörn Franke
As far as I know the TableSnapshotInputFormat relies on a temporary folder https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html Unfortunately some inputformats need a (local) tmp Directory. Sometimes this cannot be avoided. See also the source:

High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
Hi, I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input data is HBase files in AWS S3 using EMRFS, but there is no HBase running on the Spark cluster itself. It is restoring the HBase snapshot into files on disk in another S3 folder used for temporary storage, then