On Thu, Aug 7, 2014 at 3:22 AM, Reinis Vicups <[email protected]> wrote:
> During my tests I observed that there were always 2-3-4 long running tasks > that determined the critical path of the whole spark job (as in, there was > one task running for whole 18 minutes). Also I observed that only through > increasing number of partitions those long running tasks got shorter! So I > was increasing gradually number of partitions and at 400 partitions it > finally rocked. > This is likely due to skew in the statistics of different items. As you increase the size of the data, you may see better balance because the maximum frequency limit will kick in more often.
