Hi Alejandro, The WebHDFS API allows specifying an offset and length for the request. If I specify an offset that start in the second block for a file (thus skipping the first block all together), will the namenode still direct me to a datanode with the first block or will it direct me to a namenode with the second block? I.e., am I assured data locality only on the first block of the file (as you're saying) or on the first block I am accessing?
If it is as you say, then I may want to reach out the WebHDFS developers and see if they would be interested in the additional functionality. Thank you, RJ On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <[email protected]>wrote: > I may have expressed myself wrong. You don't need to do any test to see > how locality works with files of multiple blocks. If you are accessing a > file of more than one block over webhdfs, you only have assured locality > for the first block of the file. > > Thanks. > > > On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <[email protected]> wrote: > >> Thank you, Mingjiang and Alejandro. >> >> This is interesting. Since we will use the data locality information for >> scheduling, we could "hack" this to get the data locality information, at >> least for the first block. As Alejandro says, we'd have to test what >> happens for other data blocks -- e.g., what if, knowing the block sizes, we >> request the second or third block? >> >> Interesting food for thought! I see some experiments in my future! >> >> Thanks! >> >> >> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur >> <[email protected]>wrote: >> >>> well, this is for the first block of the file, the rest of the file >>> (blocks being local or not) are streamed out by the same datanode. for >>> small files (one block) you'll get locality, for large files only the first >>> block, and by chance if other blocks are local to that datanode. >>> >>> >>> Alejandro >>> (phone typing) >>> >>> On Mar 16, 2014, at 18:53, Mingjiang Shi <[email protected]> wrote: >>> >>> According to this page: >>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/ >>> >>>> *Data Locality*: The file read and file write calls are redirected to >>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop >>>> cluster for streaming data. >>>> >>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in >>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it >>>> can use all HDFS functionalities. It is a part of HDFS - there are no >>>> additional servers to install >>>> >>> >>> So it looks like the data locality is built-into webhdfs, client will be >>> redirected to the data node automatically. >>> >>> >>> >>> >>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I'm writing up a Google Summer of Code proposal to add HDFS support to >>>> Disco, an Erlang MapReduce framework. >>>> >>>> We're interested in using WebHDFS. I have two questions: >>>> >>>> 1) Does WebHDFS allow querying data locality information? >>>> >>>> 2) If the data locality information is known, can data on specific data >>>> nodes be accessed via Web HDFS? Or do all Web HDFS requests have to go >>>> through a single server? >>>> >>>> Thanks, >>>> RJ >>>> >>>> -- >>>> em [email protected] >>>> c 954.496.2314 >>>> >>> >>> >>> >>> -- >>> Cheers >>> -MJ >>> >>> >> >> >> -- >> em [email protected] >> c 954.496.2314 >> > > > > -- > Alejandro > -- em [email protected] c 954.496.2314
