Re: Spark UNEVENLY distributing data

2018-05-22 Thread Saad Mufti
I think TableInputFormat will try to maintain as much locality as possible, assigning one Spark partition per region and trying to assign that partition to a YARN container/executor on the same node (assuming you're using Spark over YARN). So the reason for the uneven distribution could be that

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
by the output of the dfsadmin command, so I am still trying to track that down. The total allocated disk space of 28 TB should still be more than enough. Saad On Sat, Apr 7, 2018 at 2:40 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Thanks. I checked and it is using another

Re: High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
nputFormat.html > > > Unfortunately some inputformats need a (local) tmp Directory. Sometimes > this cannot be avoided. > > See also the source: > > https://github.com/apache/hbase/blob/master/hbase-mapreduce/src/main/java/org/apache/hadoop/hbase/mapred/TableSnapshotInputFormat.java &

High Disk Usage In Spark 2.2.1 With No Shuffle Or Spill To Disk

2018-04-07 Thread Saad Mufti
Hi, I have a simple ETL Spark job running on AWS EMR with Spark 2.2.1 . The input data is HBase files in AWS S3 using EMRFS, but there is no HBase running on the Spark cluster itself. It is restoring the HBase snapshot into files on disk in another S3 folder used for temporary storage, then