From: Rahul Bhattacharjee [mailto:[email protected]]
Subject: Why big block size for HDFS.

>Many places it has been written that to avoid huge no of disk seeks , we store 
>big blocks in HDFS , so that once we seek to the location , then there is only 
>data transfer rate which would be predominant , no more seeks. I am not sure 
>if I have understood this correctly.
>My question is , no matter what the block size we decide , finally its getting 
>written to the computers HDD , which would be formatted and would have a block 
>size in KB's and also while writing to the FS (not HDFS) , its not guaranteed 
>that the blocks that we write are continuous , so there would be disk seeks 
>anyways .The assumption of HDFS would be only true if the underlying Fs 
>guarentees to write the data in continuous blocks.

>Can someone explain a bit.
>Thanks,
>Rahul

While there are no guarantees that disk storage will be contiguous, the OS will 
attempt to keep large files contiguous (and may even defrag over time), and if 
all files are written using large blocks, this is more likely to be the case.  
If storage is contiguous, you can write a complete track without seeking.  A 
complete track size varies, but a 1TB disk might have 500KB/track.  Stepping 
adjacent close tracks is also much cheaper than the average seek time, and as 
you might expect, has been optimized in hardware to assist sequential I/O.  
However, if you switch storage units, you will probably encounter at least one 
full seek at the start of the block (since it was probably somewhere else at 
the time).  The result is that, on average, writing sequential files is very 
fast (>100MB/sec on typical SATA).  But I think that the blocks overhead has 
more to do with finding where to read the next block from, assuming that data 
has been distributed evenly.

So consider connection overhead when the data is distributed.  I am no expert 
on the Hadoop internals, but I suspect that somewhere, a TCP connection is 
opened to transfer data.  Whether connection overhead is reduced by maintaining 
open connection pools, I don’t know.  But let’s assume that there is *some* 
overhead for switching data transfer from machine “A”  that owns block “1000” 
and machine “B” that owns block “1001”.  The larger the block size, the less 
significant will be this overhead relative to the sequential transfer rate.

In addition, MapR/YARN has an easier time of scheduling if there are fewer 
blocks.
--john

Reply via email to