The logic behind the preferred location of an RDD partition is pretty
simple. For RDDs that are based on the HDFS file, the preferred location is
set based on the where the HDFS blocks corresponding to the RDD's
partitions are located. This is done by querying the HDFS framework. For
any RDD that may be cached, the preferred location is set based on where a
partition is cached (may be replicated as well). So the system does not
maintain any history about block / partition access times, bandwidth, etc.


On Fri, Jan 24, 2014 at 1:15 AM, Sai Prasanna <[email protected]>wrote:

> Hello Everybody, Please help me with this.
>
> preferredLocations(p) method for an RDD gives nodes where partition p of a
> given RDD can be accessed faster. How does SPARK inherently implements
> this?...Does any history about access times, network bandwidth  for various
> partitions across nodes are stored and used, or else jobs allocated to a
> node only determines the preferredLocations in case for multiple copies of
> RDD.
> Or is the intelligence derived from underlying framework, say HDFS.
>
> --
> *Sai Prasanna. AN*
> *II M.Tech (CS), SSSIHL*
>
>
> *Entire water in the ocean can never sink a ship, Unless it gets inside.
> All the pressures of life can never hurt you, Unless you let them in.*
>

Reply via email to