I suppose you are running on 1.6.
I guess you need some solution based on [1], [2] features which are coming in 
2.0.

[1] https://issues.apache.org/jira/browse/SPARK-12538 
<https://issues.apache.org/jira/browse/SPARK-12538> / 
https://issues.apache.org/jira/browse/SPARK-12394 
<https://issues.apache.org/jira/browse/SPARK-12394>
[2] https://issues.apache.org/jira/browse/SPARK-12849 
<https://issues.apache.org/jira/browse/SPARK-12849>

However, I did not check for examples, I would like to add to your question and 
ask the community to link to some examples with the recent improvements/changes.

It could help however to give concrete example on your specific problem, as you 
may hit some stragglers also probably caused by data skew.

Best,
Ovidiu


> On 03 Jun 2016, at 17:31, saif.a.ell...@wellsfargo.com wrote:
> 
> Hello everyone!
>  
> I was noticing that, when reading parquet files or actually any kind of 
> source data frame data (spark-csv, etc), default partinioning is not fair.
> Action tasks usually act very fast on some partitions and very slow on some 
> others, and frequently, even fast on all but last partition (which looks like 
> it reads +50% of the data input size).
>  
> I notice that each task is loading some portion of the data, say 1024MB 
> chunks, and some task loading 20+GB of data.
>  
> Applying repartition strategies solve this issue properly and general 
> performance is increased considerably, but for very large dataframes, 
> repartitioning is a costly process.
>  
> In short, what are the available strategies or configurations that help 
> reading from disk or hdfs with proper executor-data-distribution??
>  
> If this needs to be more specific, I am strictly focused on PARQUET files rom 
> HDFS. I know there are some MIN
>  
> Really appreciate,
> Saif

Reply via email to