Thanks a lot John , Azurya. I guessed about the optimization of HDD. Then it might be good to defrag the underlying disk during general maintenance downtime.
Thanks, Rahul On Mon, Apr 1, 2013 at 12:28 AM, John Lilley <[email protected]>wrote: > ** ** > > *From:* Rahul Bhattacharjee [mailto:[email protected]] > *Subject:* Why big block size for HDFS.**** > > ** ** > > >Many places it has been written that to avoid huge no of disk seeks , we > store big blocks in HDFS , so that once we seek to the location , then > there is only data transfer rate which would be predominant , no more > seeks. I am not sure if I have understood this correctly.**** > > >My question is , no matter what the block size we decide , finally its > getting written to the computers HDD , which would be formatted and would > have a block size in KB's and also while writing to the FS (not HDFS) , its > not guaranteed that the blocks that we write are continuous , so there > would be disk seeks anyways .The assumption of HDFS would be only true if > the underlying Fs guarentees to write the data in continuous blocks.**** > > > >Can someone explain a bit.**** > > >Thanks, > >Rahul **** > > ** ** > > While there are no guarantees that disk storage will be contiguous, the OS > will attempt to keep large files contiguous (and may even defrag over > time), and if all files are written using large blocks, this is more likely > to be the case. If storage is contiguous, you can write a complete track > without seeking. A complete track size varies, but a 1TB disk might have > 500KB/track. Stepping adjacent close tracks is also much cheaper than the > average seek time, and as you might expect, has been optimized in hardware > to assist sequential I/O. However, if you switch storage units, you will > probably encounter at least one full seek at the start of the block (since > it was probably somewhere else at the time). The result is that, on > average, writing sequential files is very fast (>100MB/sec on typical > SATA). But I think that the blocks overhead has more to do with finding > where to read the next block from, assuming that data has been distributed > evenly.**** > > ** ** > > So consider connection overhead when the data is distributed. I am no > expert on the Hadoop internals, but I suspect that somewhere, a TCP > connection is opened to transfer data. Whether connection overhead is > reduced by maintaining open connection pools, I don’t know. But let’s > assume that there is **some** overhead for switching data transfer from > machine “A” that owns block “1000” and machine “B” that owns block > “1001”. The larger the block size, the less significant will be this > overhead relative to the sequential transfer rate. **** > > ** ** > > In addition, MapR/YARN has an easier time of scheduling if there are fewer > blocks.**** > > --john**** >
