Re: Anatomy of read in hdfs

Mohammad Tariq Sun, 09 Apr 2017 14:01:32 -0700

Hi Sidharth,

I'm sorry I didn't quite get the first part your question. What do you mean
by real time? Could you please elaborate it a bit? That'll help me
answering your question in a better manner.

And for your second question,

This is how write happens -

Suppose your file resides in your local file system and you have written a
program(using the HDFS API) then an input stream gets created on this file
and data gets read from it. Data continues to get buffered at the client
side and once it reaches a certain threshold, which is the block size, it
gets pushed to the datanode where it has to be written. Once the data gets
written onto this datanode it gets propagated to other datanodes for
replication based on the replication factor you have specified in your
configuration.

This process continues until the whole data gets written at the target HDFS
location. Again, since this program is a standalone application the write
will happen sequentially.

However, if your source file is already in HDFS and you have written a
distributed application, say a MapReduce program, to copy it to some other
HDFS location then reads and writes will happen in parallel based on the
number of mappers and reducers you have.

One important thing to note here os that parallelism at the read side is
based on the number of mappers created by the InputFormat you are using and
it cannot be controlled, unless you change the way InputFormat behaves, or
do some other tweaking. However, you can tweak the write operation
parallelism by changing the number of reducers in your program.

Hope this helps!

[image: --]

Tariq, Mohammad
[image: https://]about.me/mti
<https://about.me/mti?promo=email_sig&utm_source=product&utm_medium=email_sig&utm_campaign=chrome_ext>

[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>

On Sun, Apr 9, 2017 at 3:20 PM, Sidharth Kumar <[email protected]>
wrote:

> Thanks Tariq, It really helped me to understand but just one another doubt
> that if reading is not a parallel process then to ready a file of 100GB and
>  hdfs block size is 128MB. It will take lot much to read the complete file
> but it's not the scenerio in the real time. And second question is write
> operations as well is sequential process ? And will every datanode have
> their own data streamer which listen to data queue to get the packets and
> create pipeline. So, can you kindly help me to get clear idea of hdfs read
> and write operations.
>
> Regards
> Sidharth
>
> On 08-Apr-2017 12:49 PM, "Mohammad Tariq" <[email protected]> wrote:
>
> Hi Sidhart,
>
> When you read data from HDFS using a framework, like MapReduce, blocks of
> a HDFS file are read in parallel by multiple mappers created in that
> particular program. Input splits to be precise.
>
> On the other hand if you have a standalone java program then it's just a
> single thread process and will read the data sequentially.
>
>
> On Friday, April 7, 2017, Sidharth Kumar <[email protected]>
> wrote:
>
>> Thanks for your response . But I dint understand yet,if you don't mind
>> can you tell me what do you mean by "*With Hadoop, the idea is to
>> parallelize the readers (one per block for the mapper) with processing
>> framework like MapReduce.*"
>>
>> And also how the concept of parallelize the readers will work with hdfs
>>
>> Thanks a lot in advance for your help.
>>
>>
>> Regards
>> Sidharth
>>
>> On 07-Apr-2017 1:04 PM, "Philippe Kernévez" <[email protected]> wrote:
>>
>> Hi Sidharth,
>>
>> The reads are sequential.
>> With Hadoop, the idea is to parallelize the readers (one per block for
>> the mapper) with processing framework like MapReduce.
>>
>> Regards,
>> Philippe
>>
>>
>> On Thu, Apr 6, 2017 at 9:55 PM, Sidharth Kumar <
>> [email protected]> wrote:
>>
>>> Hi Genies,
>>>
>>> I have a small doubt that hdfs read operation is parallel or sequential
>>> process. Because from my understanding it should be parallel but if I read
>>> "hadoop definitive guide 4" in anatomy of read it says "*Data is
>>> streamed from the datanode back **to the client, which calls read()
>>> repeatedly on the stream (step 4). When the end of the **block is
>>> reached, DFSInputStream will close the connection to the datanode, then
>>> find **the best datanode for the next block (step 5). This happens
>>> transparently to the client, **which from its point of view is just
>>> reading a continuous stream*."
>>>
>>> So can you kindly explain me how read operation will exactly happens.
>>>
>>>
>>> Thanks for your help in advance
>>>
>>> Sidharth
>>>
>>>
>>
>>
>> --
>> Philippe Kernévez
>>
>>
>>
>> Directeur technique (Suisse),
>> [email protected]
>> +41 79 888 33 32
>>
>> Retrouvez OCTO sur OCTO Talk : http://blog.octo.com
>> OCTO Technology http://www.octo.ch
>>
>>
>>
>
> --
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
>
>

Re: Anatomy of read in hdfs

Reply via email to