Thathanga Das, With respect to HDFS, i think the job seeker will return which of the replicated nodes is the preferred locations. But on a stand-alone spark system, using native filesystem, say if partitions are cached, its straightforward to return the same. IF not cached but replicated across 3 nodes, how will spark return preferredlocations(p) in the absence of Hadoop/HDFS. In this case what is the logic ??
On Sat, Jan 25, 2014 at 12:11 AM, Tathagata Das <[email protected] > wrote: > The logic behind the preferred location of an RDD partition is pretty > simple. For RDDs that are based on the HDFS file, the preferred location is > set based on the where the HDFS blocks corresponding to the RDD's > partitions are located. This is done by querying the HDFS framework. For > any RDD that may be cached, the preferred location is set based on where a > partition is cached (may be replicated as well). So the system does not > maintain any history about block / partition access times, bandwidth, etc. > > > On Fri, Jan 24, 2014 at 1:15 AM, Sai Prasanna <[email protected]>wrote: > >> Hello Everybody, Please help me with this. >> >> preferredLocations(p) method for an RDD gives nodes where partition p of >> a given RDD can be accessed faster. How does SPARK inherently implements >> this?...Does any history about access times, network bandwidth for various >> partitions across nodes are stored and used, or else jobs allocated to a >> node only determines the preferredLocations in case for multiple copies of >> RDD. >> Or is the intelligence derived from underlying framework, say HDFS. >> >> -- >> *Sai Prasanna. AN* >> *II M.Tech (CS), SSSIHL* >> >> >> *Entire water in the ocean can never sink a ship, Unless it gets inside. >> All the pressures of life can never hurt you, Unless you let them in.* >> > > -- *Sai Prasanna. AN* *II M.Tech (CS), SSSIHL* *Entire water in the ocean can never sink a ship, Unless it gets inside.All the pressures of life can never hurt you, Unless you let them in.*
