Quite a good question, I assume you know the size of the cluster going in,
then you can essentially try to partition the data in some multiples of
that & use rangepartitioner to partition the data roughly equally. Dynamic
partitions are created based on number of blocks on filesystem & hence the
task overhead of scheduling so many tasks mostly kills the performance.

import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))




Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi <https://twitter.com/mayur_rustagi>



On Sat, Aug 16, 2014 at 7:04 AM, abhiguruvayya <sharath.abhis...@gmail.com>
wrote:

> My use case as mentioned below.
>
> 1. Read input data from local file system using sparkContext.textFile(input
> path).
> 2. partition the input data(80 million records) into partitions using
> RDD.coalesce(numberOfPArtitions) before submitting it to mapper/reducer
> function. Without using coalesce() or repartition() on the input data spark
> executes really slow and fails with out of memory exception.
>
> The issue i am facing here is in deciding the number of partitions to be
> applied on the input data. *The input data  size varies every time and hard
> coding a particular value is not an option. And spark performs really well
> only when certain optimum partition is applied on the input data for which
> i
> have to perform lots of iteration(trial and error). Which is not an option
> in a production environment.*
>
> My question: Is there a thumb rule to decide the number of partitions
> required depending on the input data size and cluster resources
> available(executors,cores, etc...)? If yes please point me in that
> direction. Any help  is much appreciated.
>
> I am using spark 1.0 on yarn.
>
> Thanks,
> AG
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Question-regarding-spark-data-partition-and-coalesce-Need-info-on-my-use-case-tp12214.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to