Hi All,
>From the documention RDDs are already partitioned distributed. However, there
>is a way to repartition a given RDD using the following function. Can someone
>please point out the best practices for using this. I have a 10 GB TSV file
>stored in HDFS and I have a 4 node cluster with 1 master and 3 workers. Each
>worker has 15 GB memory and 4 cores. My processing pipeline is not very deep
>as of now. Can someone please tell me when repartitioning is recommended? When
>the documentation says balance doe to refer to memory usage or compute load or
>I/O?
repartition(numPartitions)Reshuffle the data in the RDD randomly to create
either more or fewer partitions and balance it across them. This always
shuffles all data over the network.