Re: BLOCK and Split size question

Ahmed Ossama Sat, 21 Feb 2015 18:23:25 -0800

Hi,

Answering the first question;

What happens is that the client on the machine issuing the command'copyfromlocal' first creates a new instance of DistributedFileSystem,which makes an RPC call to the NameNode to create a new file in thefilesystem's namespace, the NN performs various checks, if passed itcreates a record for the file. Then the DistributedFileSystem returnsFSDataOutputStream for the client which handles the communication to thedatanodes.

The file is then split into packets (based on the block size) andwritten to the data queue by DataStreamer, which asks the nn to allocatenew blocks and picking up a list of dn to store the replicas on. Thelist of dn forms a pipeline with the number of replicas, theDataStreamer streams the packet to the first datanode in the pipelinewhich stores the packet and forward it to the second datanode in thepipeline. The packet keeps forwarded until the replicas are written.


When the client finish the stream it sends a confirmation to the NN.

Second question; split size is used during processing of a file, inother words the split size is consumed by the mappers not during HDFSoperations. The block is the one consumed during the read and writeoperations by HDFS client. IIRC the default size of a block inHadoop-2.6 is 256MB.

Third question; split is the logical representation of the data in ablock. Block is the physical representation of the data. As I said,splits are consumed when processing the data, a mapper reads data from ablock through a split.


Consider the following content of a file:

   000
   111
   222
   333
   444
   555
   666
   777
   888
   999
   aaa
   bbb
   ccc
   ddd
   eee
   fff


The block representation of this file could look something like this

   block-0

       000
       111
       222
       333
       444
       55

   block-1

       5
       666
       777
       888
       999
       aa

   block-2

       a
       bbb
       ccc
       ddd
       eee
       fff

If you notice, the blocks aren't aligned with the record, this happenbecause the fact that each record in a file isn't like the other. Herecomes the splits to put the records all together before being processedby a mapper.

Unlike traditional filesystems, HDFS will create files up to the size ofthe HDFS block size; Note that there are 2 block size here, theunderlying filesystem block size and HDFS block size.


On 02/21/2015 01:59 AM, SP wrote:

Hello Every one,

I have couple of doubts can any one please point me in right direction.
1>What exactly happen when I want to copy 1TB file to Hadoop Clusterusing copyfromlocal command
1> what will be the split size? will it be same as the block size?

2> What is a block and split?
If we have 100 MB file and a block size of 64 MB, As we know it willbe divided into 2 blocks of 64 MB and 36 MB the second block still has28 MB of space left what will happen to that free space?will the cluster have unequal block size or will it be occupied byother file?
3) let’s say a 64MB block is on node A and replicated among 2 othernodes(B,C), and the input split size for the map-reduce program is64MB, will this split just have location for node A? Or will it havelocations for all the three nodes A,b,C?
4) How is it handled if the Input Split size is greater or lesser thanblock size?
can any one please help?

thanks

SP


--
Regards,
Ahmed Ossama

Re: BLOCK and Split size question

Reply via email to