Hi All,
>From the documention RDDs are already partitioned distributed. However, there 
>is a way to repartition a given RDD using the following function. Can someone 
>please point out the best practices for using this. I have a 10 GB TSV file 
>stored in HDFS and I have a 4 node cluster with 1 master and 3 workers. Each 
>worker has 15 GB memory and 4 cores. My processing pipeline is not very deep 
>as of now. Can someone please tell me when repartitioning is recommended? When 
>the documentation says balance doe to refer to memory usage or compute load or 
>I/O?
repartition(numPartitions)Reshuffle the data in the RDD randomly to create 
either more or fewer partitions and balance it across them. This always 
shuffles all data over the network.



                                          

Reply via email to