Hi Saad,
May I ask which EMR version and cluster size you are using? Usually if you
are using C4.4xlarge systems then they have high disk space as I
understand.
The other thing that you can do is attach more disk space to the nodes, the
option of which is available in the advanced cluster start
I have been trying to monitor this while the job is running, I think I
forgot to account for the 3-way hdfs replication, so right there the output
is more like 21 TB instead of my claimed 7 TB. But it still looks like hdfs
is losing more disk space than can be account for by just the output, going
Thanks. I checked and it is using another s3 folder for the temporary
restore space. The underlying code insists on the snapshot and the restore
directory being on the same filesystem, so it is using Emrfs for both. So
unless Emrfs is under the covers using some local disk space it doesn't
seem
As far as I know the TableSnapshotInputFormat relies on a temporary folder
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html
Unfortunately some inputformats need a (local) tmp Directory. Sometimes this
cannot be avoided.
See also the source:
Hi,
I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The
input data is HBase files in AWS S3 using EMRFS, but there is no HBase
running on the Spark cluster itself. It is restoring the HBase snapshot
into files on disk in another S3 folder used for temporary storage, then