I suppose you are running on 1.6. I guess you need some solution based on [1], [2] features which are coming in 2.0.
[1] https://issues.apache.org/jira/browse/SPARK-12538 <https://issues.apache.org/jira/browse/SPARK-12538> / https://issues.apache.org/jira/browse/SPARK-12394 <https://issues.apache.org/jira/browse/SPARK-12394> [2] https://issues.apache.org/jira/browse/SPARK-12849 <https://issues.apache.org/jira/browse/SPARK-12849> However, I did not check for examples, I would like to add to your question and ask the community to link to some examples with the recent improvements/changes. It could help however to give concrete example on your specific problem, as you may hit some stragglers also probably caused by data skew. Best, Ovidiu > On 03 Jun 2016, at 17:31, saif.a.ell...@wellsfargo.com wrote: > > Hello everyone! > > I was noticing that, when reading parquet files or actually any kind of > source data frame data (spark-csv, etc), default partinioning is not fair. > Action tasks usually act very fast on some partitions and very slow on some > others, and frequently, even fast on all but last partition (which looks like > it reads +50% of the data input size). > > I notice that each task is loading some portion of the data, say 1024MB > chunks, and some task loading 20+GB of data. > > Applying repartition strategies solve this issue properly and general > performance is increased considerably, but for very large dataframes, > repartitioning is a costly process. > > In short, what are the available strategies or configurations that help > reading from disk or hdfs with proper executor-data-distribution?? > > If this needs to be more specific, I am strictly focused on PARQUET files rom > HDFS. I know there are some MIN > > Really appreciate, > Saif