Re: Data Locality and WebHDFS

RJ Nowling Mon, 17 Mar 2014 09:48:54 -0700

Hi Alejandro,

The WebHDFS API allows specifying an offset and length for the request.  If
I specify an offset that start in the second block for a file (thus
skipping the first block all together), will the namenode still direct me
to a datanode with the first block or will it direct me to a namenode with
the second block?  I.e., am I assured data locality only on the first block
of the file (as you're saying) or on the first block I am accessing?


If it is as you say, then I may want to reach out the WebHDFS developers
and see if they would be interested in the additional functionality.

Thank you,
RJ


On Mon, Mar 17, 2014 at 2:40 AM, Alejandro Abdelnur <[email protected]>wrote:

> I may have expressed myself wrong. You don't need to do any test to see
> how locality works with files of multiple blocks. If you are accessing a
> file of more than one block over webhdfs, you only have assured locality
> for the first block of the file.
>
> Thanks.
>
>
> On Sun, Mar 16, 2014 at 9:18 PM, RJ Nowling <[email protected]> wrote:
>
>> Thank you, Mingjiang and Alejandro.
>>
>> This is interesting.  Since we will use the data locality information for
>> scheduling, we could "hack" this to get the data locality information, at
>> least for the first block.  As Alejandro says, we'd have to test what
>> happens for other data blocks -- e.g., what if, knowing the block sizes, we
>> request the second or third block?
>>
>> Interesting food for thought!  I see some experiments in my future!
>>
>> Thanks!
>>
>>
>> On Sun, Mar 16, 2014 at 10:14 PM, Alejandro Abdelnur 
>> <[email protected]>wrote:
>>
>>> well, this is for the first block of the file, the rest of the file
>>> (blocks being local or not) are streamed out by the same datanode. for
>>> small files (one block) you'll get locality, for large files only the first
>>> block, and by chance if other blocks are local to that datanode.
>>>
>>>
>>> Alejandro
>>> (phone typing)
>>>
>>> On Mar 16, 2014, at 18:53, Mingjiang Shi <[email protected]> wrote:
>>>
>>> According to this page:
>>> http://hortonworks.com/blog/webhdfs-%E2%80%93-http-rest-access-to-hdfs/
>>>
>>>> *Data Locality*: The file read and file write calls are redirected to
>>>> the corresponding datanodes. It uses the full bandwidth of the Hadoop
>>>> cluster for streaming data.
>>>>
>>>> *A HDFS Built-in Component*: WebHDFS is a first class built-in
>>>> component of HDFS. It runs inside Namenodes and Datanodes, therefore, it
>>>> can use all HDFS functionalities. It is a part of HDFS - there are no
>>>> additional servers to install
>>>>
>>>
>>> So it looks like the data locality is built-into webhdfs, client will be
>>> redirected to the data node automatically.
>>>
>>>
>>>
>>>
>>> On Mon, Mar 17, 2014 at 6:07 AM, RJ Nowling <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm writing up a Google Summer of Code proposal to add HDFS support to
>>>> Disco, an Erlang MapReduce framework.
>>>>
>>>> We're interested in using WebHDFS.  I have two questions:
>>>>
>>>> 1) Does WebHDFS allow querying data locality information?
>>>>
>>>> 2) If the data locality information is known, can data on specific data
>>>> nodes be accessed via Web HDFS?  Or do all Web HDFS requests have to go
>>>> through a single server?
>>>>
>>>> Thanks,
>>>> RJ
>>>>
>>>> --
>>>> em [email protected]
>>>> c 954.496.2314
>>>>
>>>
>>>
>>>
>>> --
>>> Cheers
>>> -MJ
>>>
>>>
>>
>>
>> --
>> em [email protected]
>> c 954.496.2314
>>
>
>
>
> --
> Alejandro
>



-- 
em [email protected]
c 954.496.2314

Re: Data Locality and WebHDFS

Reply via email to