Thanks for the clarification Rahul. In that case, then the reading is correct (and that a HDFS client behaves the same, in and out of MR - its not really related to MR at all).
A "client outside" would write to a random set of datanode, across at least two racks for 3 replicas if rack awareness is turned on. On Fri, May 17, 2013 at 8:17 AM, Rahul Bhattacharjee <[email protected]> wrote: > Hi Harsh, > > I think what John meant by writing to local disk is writing to the same data > node first which has initiated the write call. > > John can further clarify. > > > On Fri, May 17, 2013 at 4:23 AM, Harsh J <[email protected]> wrote: >> >> That is not true. HDFS writes are not staged to a local disk first >> before being written onto the DataNodes. The old architecture docs >> seem to suggest that the writes get staged to a local disk but thats >> not true anymore, see https://issues.apache.org/jira/browse/HDFS-1454. >> >> Also worth noting that a HDFS client behaves the same way in almost >> all contexts, whether its invoked from an MR framework or directly >> from shell. >> >> On Fri, May 17, 2013 at 3:38 AM, John Lilley <[email protected]> >> wrote: >> > I seem to recall reading that when a MapReduce task writes a file, the >> > blocks of the file are always written to local disk, and replicated to >> > other >> > nodes. If this is true, is this also true for non-MR applications >> > writing >> > to HDFS from Hadoop worker nodes? What about clients outside of the >> > cluster >> > doing a file load? >> > >> > Thanks >> > >> > John >> > >> > >> >> >> >> -- >> Harsh J > > -- Harsh J
