As far as I know the TableSnapshotInputFormat relies on a temporary folder
https://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableSnapshotInputFormat.html


Unfortunately some inputformats need a (local) tmp Directory. Sometimes this 
cannot be avoided.

See also the source:
https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java

> On 7. Apr 2018, at 20:26, Saad Mufti <saad.mu...@gmail.com> wrote:
> 
> Hi,
> 
> I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input 
> data is HBase files in AWS S3 using EMRFS, but there is no HBase running on 
> the Spark cluster itself. It is restoring the HBase snapshot into files on 
> disk in another S3 folder used for temporary storage, then creating an RDD 
> over those files using HBase's TableSnapsotInputFormat class. There is a 
> large number of HBase regions, around 12000, and each region gets translated 
> to one Spark task/partition. We are running in YARN mode, with one core per 
> executor, so on our 120 node cluster we have around 1680 executors running 
> (not the full 1960 as YARN only gives us so many containers due to memory 
> limits).
> 
> This is a simple ETL job that transforms the HBase data into Avro/Parquet and 
> writes to disk, there are no reduces or joins of any kind. The output Parquet 
> data is using Snappy compression, the total output is around 7 TB while we 
> have about 28 TB total disk provisioned in the cluster. The Spark UI shows no 
> disk storage being used for cached data, and not much heap being used for 
> caching either, which makes sense because in this simple job we have no need 
> to do RDD.cache as the RDD is not reused at all.
> 
> So lately the job has started failing because close to finishing, some of the 
> YARN nodes start running low on disk and YARN marks them as unhealthy, then 
> kills all the executors on that node. But the problem just moves to another 
> node where the tasks are relaunched for another attempt until after 4 
> failures for a given task the whole job fails.
> 
> So I am trying to understand where all this disk usage is coming from? I can 
> see in Ganglia that disk is running low the longer the job runs no matter 
> which node I look at. Like I said the total output size of the final output 
> in hdfs is only around 7 TB while we have around 28 TB of disk provisioned 
> for hdfs.
> 
> Any advice or pointers for where to look for the large disk usage would be 
> most appreciated.
> 
> Thanks.
> 
> ----
> Saad
> 

Reply via email to