As far as I know the TableSnapshotInputFormat relies on a temporary folder https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html
Unfortunately some inputformats need a (local) tmp Directory. Sometimes this cannot be avoided. See also the source: https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java > On 7. Apr 2018, at 20:26, Saad Mufti <saad.mu...@gmail.com> wrote: > > Hi, > > I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input > data is HBase files in AWS S3 using EMRFS, but there is no HBase running on > the Spark cluster itself. It is restoring the HBase snapshot into files on > disk in another S3 folder used for temporary storage, then creating an RDD > over those files using HBase's TableSnapsotInputFormat class. There is a > large number of HBase regions, around 12000, and each region gets translated > to one Spark task/partition. We are running in YARN mode, with one core per > executor, so on our 120 node cluster we have around 1680 executors running > (not the full 1960 as YARN only gives us so many containers due to memory > limits). > > This is a simple ETL job that transforms the HBase data into Avro/Parquet and > writes to disk, there are no reduces or joins of any kind. The output Parquet > data is using Snappy compression, the total output is around 7 TB while we > have about 28 TB total disk provisioned in the cluster. The Spark UI shows no > disk storage being used for cached data, and not much heap being used for > caching either, which makes sense because in this simple job we have no need > to do RDD.cache as the RDD is not reused at all. > > So lately the job has started failing because close to finishing, some of the > YARN nodes start running low on disk and YARN marks them as unhealthy, then > kills all the executors on that node. But the problem just moves to another > node where the tasks are relaunched for another attempt until after 4 > failures for a given task the whole job fails. > > So I am trying to understand where all this disk usage is coming from? I can > see in Ganglia that disk is running low the longer the job runs no matter > which node I look at. Like I said the total output size of the final output > in hdfs is only around 7 TB while we have around 28 TB of disk provisioned > for hdfs. > > Any advice or pointers for where to look for the large disk usage would be > most appreciated. > > Thanks. > > ---- > Saad >